# Cdf and Pdf Functions

## PDF(X) and CDF(X)

PDF generates a histogram or * probability density function* for «X», where «X» is a sample of data. CDF generates a

*for «X». They are similar to the methods used to generate the uncertainty views*

**cumulative distribution function****CDF**for uncertain quantities. But, as functions, they return results as arrays available for further processing, display, or export. They can also work with data with indexes other than Run, the default index for uncertain samples. If «X» contains a sample from a discrete distribution, the result is a probability mass function (histogram) or density function. Similarly,

**CDF**can generate a cumulative mass or cumulative distribution function.

**PDF** and **CDF** have one required parameter, «X» to denote sample data points, indexed by `I`

. The functions also accept several optional parameters, described below, with the following syntax:

`PDF(x: [I]; I: IndexType=Run; w: NonNegative[I] = SampleWeighting; discrete: optional boolean; binMethod, samplesPerStep: optional positive; domain: Unevaluated = x)`

`CDF(x: [I]; I: IndexType=Run; w: NonNegative [I] = SampleWeighting; discrete: Optional Boolean; binMethod, samplesPerStep: Optional Positive; domain: Unevaluated = x)`

### Examples

A common use is to generate the **PDF** or **CDF** table of an uncertain variable «X», generated as a random sample, e.g.:

`PDF(X)`

`CDF(X)`

Here the distribution, «X», should be uncertain -- i.e. a sample indexed by Run, usually generated from a probability distribution.

### Continuous or discrete?

Usually, PDF and CDF figure out whether the «X» is discrete or continuous automatically. They assume «X» is discrete if it contains text values or if it contains numerical values with many repetitions -- or continuous if it contains only numbers with few or no repetitions. You can override that assumption by specifying the optional parameter `discrete: True`

or `discrete: False`

.

If the distribution is continuous, the result is indexed by Step, and DensityIndex, with elements 'X' and 'Y', where 'y' contains the probability density (or cumulative probability for CDF). If it is discrete, the result contains the probability mass (or cumulative probability for CDF) indexed by PossibleValues.

### Histograms

You can also use **PDF** and **CDF** to generate histograms of data that is not uncertain, i.e. indexed by something other than Run. For example, to generate a histogram of `Y`

over index `J`

, use:

`PDF(Y, J)`

## Optional parameters

### I

The index over which the functions generate the histogram. By default this is Run -- i.e. a Monte Carlo sample -- but you can also specify another index to generate a histogram over another dimension.

### W

The sample weights. Can be used to weight each sample point differently. Defaults to system variable **SampleWeights**.

### Discrete

Set `True`

or `False`

to force discrete or continuous treatment.

### SpacingMethod

Selects the histogramming method used. Otherwise it uses the system default set in the Uncertainty Setup dialog from the Result menu. Options are:

`0`

= "equal-X": Equal steps along the «X» axis (values of «X»).`1`

= "equal-sample-P": Equal numbers of sample values in each step.`2`

= "equal-weighted-P": Equal sum of weights of samples, weighted by «w».

### SamplesPerStep

An integer specifying the number of samples per bin. Otherwise it uses the system default set in the **Uncertainty setup** dialog from the **Result** menu.

### Domain

Name of a variable whose Domain attribute should be used (see below)

### SmoothingMethod

`0`

= "Histogram": Shows**PDF**as a histogram`1`

= "KDE": Uses Kernel Density Estimation to generate a smooth curve with «smoothingFactor» below from`-1`

to`1`

(default`0`

)`2`

= "KDE": Uses Kernel Density Estimation to generate a smooth curve with «smoothingFactor» below treated as global bandwidth in same units as «x».

### SmoothingFactor

If «smoothingMethod» is KDE, this factor specifies the degree of smoothness from `-1`

, maximal detail to `+1`

maximal smoothing. Usually the best value is `0`

, which is the default.

### Exceedance

*New to Analytica 5.4*

When «exceedance» is specified as true to the Cdf function, the function returns the *exceedance curve* instead of the *cumulative probability*. The exceedance curve is just one minus the CDF curve (i.e., the CDF curve flipped vertically), which denotes the probability that the true outcome exceeds the given level. Some fields of study prefer the use of exceedance in place of CDF curves.

### Is the distribution discrete or continuous?

**PDF**(X) generates a probability mass function or density function according to whether it thinks «X» is discrete or continuous. **CDF**(x) does the same, generating a cumulative mass or cumulative probability function. If «X» contains text values it knows «X» must be discrete. If «X» contains numbers with few or no identical values, it guesses continuous. If «X» contains numbers with many identical values, it guesses discrete.

Usually, they guess correctly. But, sometimes, such as with discrete distributions over a wide range of integers, it may be ambiguous. In such cases, there are two ways to make sure it does what you want:

- If «X» is a variable, you can set its Domain attribute as:
`Continuous`

`Discrete Numeric, Categorical, List of Numbers, List of Labels`

, or`Index`

-- all of which it treats a discrete.`Automatic`

is the default, meaning Analytica guesses.

- If «X» is an expression, set the optional parameter «Discrete» to
**PDF**or**CDF**as`True`

or`False`

If «X» contains text values, i.e. categorical data, you may want to control the order of the categories, e.g. `["Low", "Medium", "High"]`

. You can do this by specifying the its Domain as a List of Labels with these values, or as an **Index**, referring to an Index using them. Alternatively, you can provide a list of labels to the optional «Domain» parameter of **PDF** or **CDF**. If «X» is an expression rather than a variable, this is your only choice.

### Weighted Data

Normally, each point in a data set or sample carries equal weight. However, in some situations data or sample points may have unequal weights. When the running index is Run (i.e., the case of variables with uncertainty), the SampleWeighting system variable provides the default weighting (which itself defaults to equally weighted points). The default weighting can be provided explicitly using the «w» parameter, for example:

`PDF(Total_revenue, w: (SalesByRegion < ProjectedSales)[Region = 'East Coast'])`

This expression computes the posterior probability of total revenue given that the east coast sales are less than projected, which is accomplished by providing a zero weight for all points not consistent with the assumption.

## More details

*Text below needs editing*
**PDF** and **CDF** behave differently depending on whether the domain of «x» is discrete or continuous. **PDF** and **CDF** determine whether the domain is discrete or continuous as follows. If «x» contains non-numerics, then the domain is discrete. Otherwise, if the optional discrete parameter is specified, its value (`true = discrete, false = continuous`

) is used. Otherwise, if the domain parameter contains a variable identifier, the domain attribute for that variable is consulted. (The user of **PDF**/**CDF** would seldom, if ever, explicitly specify the domain parameter, but if the first parameter to **PDF**/**CDF** is a variable identifier, then the domain parameter will pick that up). If the domain attribute is set to `Continuous`

, then a continuous domain is used. If it is set to `Discrete`

(numeric or categorical), if the domain is an explicit list or list of labels, or if it is set to an Index, then a discrete domain is used. Otherwise (i.e., the domain attribute is automatic", or the domain parameter is not a variable identifier, **PDF** uses some heuristics to "guess" whether «x» is discrete numeric or continuous. The heuristics judge such things as whether the value in «x» appear to be regular integer multiples (as would occur from a discrete distribution such as Poisson or Binomial).

When **PDF**/**CDF** uses a discrete domain, the domain parameter contains a variable identifier, and the domain attribute of that variable contains an explicit list of values, or an index with explicit values, then those values are used, in that order, as the domain of **PDF**/**CDF**. If no such domain declaration is available, then the set of unique values in «x» are used as the domain. If a variable with an explicit domain was found, that variable serves as the index of possible values. If so such domain variable was utilized, a local magic "magic" local index named `PossibleValues`

is used. The result is indexed either by this domain index or the local `PossibleValues`

index. The value in each cell of the array is the relative frequency of occurrence of that value.

When **PDF** uses a continuous domain, the result will be indexed by `Step`

and `DensityIndex`

(plus any abstracted indexes in the parameters). `Step`

is a "magic" local index with the name "Step". `DensityIndex`

is a system variable index containing two elements, `["X", "Y"]`

. The "X" column of the result contains the centroid for each "bin" of the histogram, while the "Y" column contains the density estimate for that bin.

To construct a continuous **PDF**/**CDF**, the algorithm must partition the set of reals into bins. The key operation is determining where to place these bins (or, more accurately, the boundaries between these bins). There are three algorithms that may be employed for doing this:

*EqualX*(`spacingMethod = 0`

)- Divides the range of value occurring in «X» into equal-sided intervals.

*Equal Weighted Prob*(`spacingMethod = 1`

)- Selects bins with variable sizes so that each bin contains the same amount of weighted probability mass

*Equal Sample Prob*(`spacingMethod = 2`

).- selects bins with variable sizes so that roughly the same number of points fall into each bin.

With a constant weighting, *Equal Weighted P* and *Equal Sample P* should be identical, up to numeric round-off effects. The «samplesPerStep» parameter controls how finely partitioned the histogram is, specifying how many points, on average, should land in each bin. These controls can be supplied explicitly via optional parameters to **PDF** or **CDF**, or if they aren't specified, **PDF** will obtain them from the settings specified on the **Uncertainty Settings** dialog. If «x» (i.e., domain) is a variable identifier, then the local settings for that variable are used, otherwise, the local settings for the variable whose definition contains the call to **PDF** is used. If that is not set, then the global settings are used.

Analytica defaults to an *EqualX* «spacingMethod» for **PDF**, and an *EqualP* method for **CDF**s.

Once the bins are selected, the density estimate is just the ratio of the proportion of points in the bin divided by the bin's width.

## History

Introduced in Analytica 4.0

Enable comment auto-refresher