Cdf and Pdf Functions

Revision as of 20:02, 31 January 2007 by Lchrisman (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Computes the cumulative probability distribution or probability function from a sampled distribution. This is done through histogramming. For continuous distributions, these estimate the cumulative density and probability density. For discrete distributions, these estimate the cumulative probability and probability.

These can be applied to a weighted sample.

Declarations

The complete delcarations for these functions are:

PDF(x : ContextSamp[I]; I : IndexType=Run ; w:NonNegative ContextSamp[I]=SampleWeighting ; 
    discrete : optional boolean ; method, samplesPerStep : optional positive ; domain : Unevaluated = x)
CDF(x : ContextSamp[I]; I : IndexType=Run ; w:NonNegative ContextSamp[I]=SampleWeighting ; 
    discrete : optional boolean ; method, samplesPerStep : optional positive ; domain : Unevaluated = x)

where:

  • x : The sample data points
  • I : The running index (Run by default)
  • w : The sample weights. Can be used to weight each sample point differently.
  • discrete: To force a continuous or discrete treatment of the data.
  • method : Selects the histograming method used (equal-X, equal-sample-P, equal-weighted-P).
  • samplesPerStep: Controls bin size
  • domain : variable (if any) containing the domain information.

Simple Usage

The simplest and most typical usage returns the PDF or CDF table that you would see in a result view when viewing the result in PDF or CDF mode. To get this result, the functions are called with a single parameter, e.g.:

PDF(Ch1)
CDF(Ch1)

Here the distribution, Ch1, contains uncertainty, and is therefore has a sample indexed by Run.

Histograming Data

PDF and CDf can be applied to arrays of data indexed by something other than run as a means for obtaining a histogram of that data. For example, to histogram a quantity X along index J, use:

PDF( X, J )
CDF( X, J )

Controlling Interpretation

When any of the above examples are evaluated, Analytica must infer whether the data in X is discrete or continuous. Also, if the data is discrete, it must infer an ordering on the data. To help ensure infers these as you desire, it is best to set the domain attribute of X, where these things can be specified. For example, if X is a discrete numeric quantity (such as a distribution on the integers), you would set the domain of X to "Discrete Numeric". If X contains non-numeric values, you can control the ordering of these by setting the domain of X to a list of labels (or an index-domain), so that the domain specifies the ordering.

Specifying domain information about the data in the domain attribute is usually the prefered method for providing this information to the functions. However, this information can also be specified using the optional parameters, which override any domain value. If an expression (rather than a single variable) is used as the first parameter, then Analytica will not have the benefit of consulting a domain attribute, so this information must then be controlled through the optional parameters Discrete and Domain. Provide a boolean to Discrete to control whether numeric data should be interpreted as discrete or continuous, and to control the domain ordering for categorical data, specify a variable in the Domain parameter containing domain ordering in its domain attribute.

Weighted Data

Normally, each point in a data set or sample carries equal weight. However, in some situations data or sample points may have unequal weights. When the running index is Run (i.e., the case of variables with uncertainty), the SampleWeighting system variable provides the default weighting (which itself defaults to equally weighted points). The default weighting can be provided explicity using the w parameter, for example:

PDF( Total_revenue, w: (SalesByRegion<ProjectedSales)[Region='East Coast'] )

This expression computes the posterior probability of total revenue given that the east coast sales are less than projected, which is accomplished by providing a zero weight for all points not consistent with the assumption.


Detailed Description

PDF and CDF behave differently depending on whether the domain of x is discrete or continuous. PDF and CDF determine whether the domain is discrete or continuous as follows. If x contains non-numerics, then the domain is discrete. Otherwise, if the optional discrete parameter is specified, its value (true=discrete, false=continuous) is used. Otherwise, if the domain parameter contains a variable identifier, the domain attribute for that variable is consulted. (The user of PDF/CDF would seldom, if ever, explicitly specify the domain parameter, but if the first parameter to PDF/CDF is a variable identiifer, then the domain parameter will pick that up). If the domain attribute is set to Continuous, then a continuous domain is used. If it is set to Discrete (numeric or categorical), if the domain is an explicit list or list of labels, or if it is set to an Index, then a discrete domain is used. Otherwise (i.e., the domain attribute is automatic", or the domain parameter is not a variable identiifer, PDF uses some heuristics to "guess" whether x is discrete numeric or continuous. The heuristics judge such things as whether the value in x appear to be regular integer multiples (as would occur from a discrete distribution such as Poisson or Binomial).

When PDF/CDF uses a discrete domain, the domain parameter contains a variable identifier, and the domain attribute of that variable contains an explicit list of values, or an index with explicit values, then those values are used, in that order, as the domain of PDF/CDF. If no such domain declaration is available, then the set of unique values in x are used as the domain. If a variable with an explicit domain was found, that variable serves as the index of possible values. If so such domain variable was utilized, a local magic "magic" local index named "PossibleValues" is used. The result is indexed either by this domain index or the local "PossibleValues" index. The value in each cell of the array is the relative frequency of occurrence of that value.

When PDF uses a continuous domain, the result will be indexed by "Step" and "DensityIndex" (plus any abstracted indexes in the parameters). Step is a "magic" local index with the name "Step". DensityIndex is a system variable index containing two elements, ["X", "Y"]. The "X" column of the result contains the centroid for each "bin" of the histogram, while the "Y" column contains the density estimate for that bin.

To construct a continuous PDF/CDF, the algorithm must partition the set of reals into bins. The key operation is determining where to place these bins (or, more accurately, the boundaries between these bins). There are three algorithms that may be employed for doing this: EqualX (method=0), Equal Weighted Prob (method=1) and Equal Sample Prob (method=2). EqualX divides the range of value occuring in X into equal sided intervals. The Equal Weighted P selects bins with variable sizes so that each bin contains the same amount of weighted probability mass. The Equal Sample P method selects bins with variable sizes so that roughly the same number of points fall into each bin. (With a constant weighting, Equal Weighted P and Equal Sample P should be identical, up to numeric round-off effects). The samplesPerStep controls how finely partitioned the histogram is, specifying how many points, on average, should land in each bin. These controls can be supplied explicitly via optional parameters to PDF or CDF, or if they aren't specified, PDF will obtain them from the settings specified on the Uncertainty Settings dialog. If x (i.e., domain) is a variable identifier, then the local settings for that variable are used, otherwise, the local settings for the variable whose definition contains the call to PDF is used. If that is not set, then the global settings are used. Analytica defaults to an EqualX method for PDF, and an EqualP method for CDFs.

Once the bins are selected, the density estimate is just the ratio of the proportion of points in the bin divided by the bin's width.

Comments


You are not allowed to post comments.