Cdf and Pdf Functions
PDF(X) and CDF(X)
PDF generates a histogram or probability density function for «X», where «X» is a sample of data. CDF generates a cumulative distribution function for «X». They are similar to the methods used to generate the uncertainty views PDF and CDF for uncertain quantities. But, as functions, they return results as arrays available for further processing, display, or export. They can also work with data with indexes other than Run, the default index for uncertain samples. If «X» contains a sample from a discrete distribution, the result is a probability mass function (histogram) or density function. Similarly, CDF can generate a cumulative mass or cumulative distribution function.
PDF and CDF have one required parameter, «X» to denote sample data points, indexed by I
. The functions also accept several optional parameters, described below, with the following syntax:
PDF(x: [I]; I: IndexType=Run; w: NonNegative[I] = SampleWeighting; discrete: optional boolean; binMethod, samplesPerStep: optional positive; domain: Unevaluated = x)
CDF(x: [I]; I: IndexType=Run; w: NonNegative [I] = SampleWeighting; discrete: Optional Boolean; binMethod, samplesPerStep: Optional Positive; domain: Unevaluated = x)
Examples
A common use is to generate the PDF or CDF table of an uncertain variable «X», generated as a random sample, e.g.:
PDF(X)
CDF(X)
Here the distribution, «X», should be uncertain -- i.e. a sample indexed by Run, usually generated from a probability distribution.
Continuous or discrete?
Usually, PDF and CDF figure out whether the «X» is discrete or continuous automatically. They assume «X» is discrete if it contains text values or if it contains numerical values with many repetitions -- or continuous if it contains only numbers with few or no repetitions. You can override that assumption by specifying the optional parameter discrete: True
or discrete: False
.
If the distribution is continuous, the result is indexed by Step, and DensityIndex, with elements 'X' and 'Y', where 'y' contains the probability density (or cumulative probability for CDF). If it is discrete, the result contains the probability mass (or cumulative probability for CDF) indexed by PossibleValues.
Histograms
You can also use PDF and CDF to generate histograms of data that is not uncertain, i.e. indexed by something other than Run. For example, to generate a histogram of Y
over index J
, use:
PDF(Y, J)
Optional parameters
I
The index over which the functions generate the histogram. By default this is Run -- i.e. a Monte Carlo sample -- but you can also specify another index to generate a histogram over another dimension.
W
The sample weights. Can be used to weight each sample point differently. Defaults to system variable SampleWeights.
Discrete
Set True
or False
to force discrete or continuous treatment.
SpacingMethod
Selects the histogramming method used. Otherwise it uses the system default set in the Uncertainty Setup dialog from the Result menu. Options are:
0
= "equal-X": Equal steps along the «X» axis (values of «X»).1
= "equal-sample-P": Equal numbers of sample values in each step.2
= "equal-weighted-P": Equal sum of weights of samples, weighted by «w».
SamplesPerStep
An integer specifying the number of samples per bin. Otherwise it uses the system default set in the Uncertainty setup dialog from the Result menu.
Domain
Name of a variable whose Domain attribute should be used (see below)
SmoothingMethod
0
= "Histogram": Shows PDF as a histogram1
= "KDE": Uses Kernel Density Estimation to generate a smooth curve with «smoothingFactor» below from-1
to1
(default0
)2
= "KDE": Uses Kernel Density Estimation to generate a smooth curve with «smoothingFactor» below treated as global bandwidth in same units as «x».
SmoothingFactor
If «smoothingMethod» is KDE, this factor specifies the degree of smoothness from -1
, maximal detail to +1
maximal smoothing. Usually the best value is 0
, which is the default.
Exceedance
New to Analytica 5.4
When «exceedance» is specified as true to the Cdf function, the function returns the exceedance curve instead of the cumulative probability. The exceedance curve is just one minus the CDF curve (i.e., the CDF curve flipped vertically), which denotes the probability that the true outcome exceeds the given level. Some fields of study prefer the use of exceedance in place of CDF curves.
Is the distribution discrete or continuous?
PDF(X) generates a probability mass function or density function according to whether it thinks «X» is discrete or continuous. CDF(x) does the same, generating a cumulative mass or cumulative probability function. If «X» contains text values it knows «X» must be discrete. If «X» contains numbers with few or no identical values, it guesses continuous. If «X» contains numbers with many identical values, it guesses discrete.
Usually, they guess correctly. But, sometimes, such as with discrete distributions over a wide range of integers, it may be ambiguous. In such cases, there are two ways to make sure it does what you want:
- If «X» is a variable, you can set its Domain attribute as:
Continuous
Discrete Numeric, Categorical, List of Numbers, List of Labels
, orIndex
-- all of which it treats a discrete.Automatic
is the default, meaning Analytica guesses.
- If «X» is an expression, set the optional parameter «Discrete» to PDF or CDF as
True
orFalse
If «X» contains text values, i.e. categorical data, you may want to control the order of the categories, e.g. ["Low", "Medium", "High"]
. You can do this by specifying the its Domain as a List of Labels with these values, or as an Index, referring to an Index using them. Alternatively, you can provide a list of labels to the optional «Domain» parameter of PDF or CDF. If «X» is an expression rather than a variable, this is your only choice.
Weighted Data
Normally, each point in a data set or sample carries equal weight. However, in some situations data or sample points may have unequal weights. When the running index is Run (i.e., the case of variables with uncertainty), the SampleWeighting system variable provides the default weighting (which itself defaults to equally weighted points). The default weighting can be provided explicitly using the «w» parameter, for example:
PDF(Total_revenue, w: (SalesByRegion < ProjectedSales)[Region = 'East Coast'])
This expression computes the posterior probability of total revenue given that the east coast sales are less than projected, which is accomplished by providing a zero weight for all points not consistent with the assumption.
More details
Text below needs editing
PDF and CDF behave differently depending on whether the domain of «x» is discrete or continuous. PDF and CDF determine whether the domain is discrete or continuous as follows. If «x» contains non-numerics, then the domain is discrete. Otherwise, if the optional discrete parameter is specified, its value (true = discrete, false = continuous
) is used. Otherwise, if the domain parameter contains a variable identifier, the domain attribute for that variable is consulted. (The user of PDF/CDF would seldom, if ever, explicitly specify the domain parameter, but if the first parameter to PDF/CDF is a variable identifier, then the domain parameter will pick that up). If the domain attribute is set to Continuous
, then a continuous domain is used. If it is set to Discrete
(numeric or categorical), if the domain is an explicit list or list of labels, or if it is set to an Index, then a discrete domain is used. Otherwise (i.e., the domain attribute is automatic", or the domain parameter is not a variable identifier, PDF uses some heuristics to "guess" whether «x» is discrete numeric or continuous. The heuristics judge such things as whether the value in «x» appear to be regular integer multiples (as would occur from a discrete distribution such as Poisson or Binomial).
When PDF/CDF uses a discrete domain, the domain parameter contains a variable identifier, and the domain attribute of that variable contains an explicit list of values, or an index with explicit values, then those values are used, in that order, as the domain of PDF/CDF. If no such domain declaration is available, then the set of unique values in «x» are used as the domain. If a variable with an explicit domain was found, that variable serves as the index of possible values. If so such domain variable was utilized, a local magic "magic" local index named PossibleValues
is used. The result is indexed either by this domain index or the local PossibleValues
index. The value in each cell of the array is the relative frequency of occurrence of that value.
When PDF uses a continuous domain, the result will be indexed by Step
and DensityIndex
(plus any abstracted indexes in the parameters). Step
is a "magic" local index with the name "Step". DensityIndex
is a system variable index containing two elements, ["X", "Y"]
. The "X" column of the result contains the centroid for each "bin" of the histogram, while the "Y" column contains the density estimate for that bin.
To construct a continuous PDF/CDF, the algorithm must partition the set of reals into bins. The key operation is determining where to place these bins (or, more accurately, the boundaries between these bins). There are three algorithms that may be employed for doing this:
- EqualX (
spacingMethod = 0
)- Divides the range of value occurring in «X» into equal-sided intervals.
- Equal Weighted Prob (
spacingMethod = 1
)- Selects bins with variable sizes so that each bin contains the same amount of weighted probability mass
- Equal Sample Prob (
spacingMethod = 2
).- selects bins with variable sizes so that roughly the same number of points fall into each bin.
With a constant weighting, Equal Weighted P and Equal Sample P should be identical, up to numeric round-off effects. The «samplesPerStep» parameter controls how finely partitioned the histogram is, specifying how many points, on average, should land in each bin. These controls can be supplied explicitly via optional parameters to PDF or CDF, or if they aren't specified, PDF will obtain them from the settings specified on the Uncertainty Settings dialog. If «x» (i.e., domain) is a variable identifier, then the local settings for that variable are used, otherwise, the local settings for the variable whose definition contains the call to PDF is used. If that is not set, then the global settings are used.
Analytica defaults to an EqualX «spacingMethod» for PDF, and an EqualP method for CDFs.
Once the bins are selected, the density estimate is just the ratio of the proportion of points in the bin divided by the bin's width.
History
Introduced in Analytica 4.0
Enable comment auto-refresher