Statistical Functions and Importance Weighting

Revision as of 01:27, 25 May 2023 by Max (talk | contribs) (→‎Frequency(x, a, i, w))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)



A statistical function computes a quantity that summarizes some aspect of a sample of data. These are reduction functions, in that the running dimension is not present in the result; however, in some cases, other dimensions may be introduced. Many of these functions are used by the Analytica result views, and hence the detailed descriptions given here apply to these result views as well.

Statistical functions include: Mean, SDeviation, Variance, Skewness, Kurtosis, GetFract, Correlation, RankCorrel, Frequency, Probability, ProbBands, Statistics, PDF and CDF.

Concepts common to all statistical functions are covered initially, then each function is described with greater precision.

The Run Index

Analytica represents any uncertain quantity as a random sample from a probability distribution over the built-in index Run, which goes from 1 to SampleSize. By default, statistical functions force their main parameters x to be evaluated as a probability distribution, and they operate over Run. For example, Mean(X) returns the mean estimated from the random sample for X.

You can also apply any statistical function over another index, if you specify it explicitly, for example, Mean(X, i) computes the mean value of array X over index i. If you specify an index other than Run, a statistical function evaluates its main parameter X as deterministic or probabilistic according to the context mode in which it is being evaluated (see the next section).

Evaluation Mode

When Analytica evaluates an expression, it is always done within an evaluation mode, either mid-mode or sample-mode. The result computed by an expression may differ in the two modes if it contains uncertainty.

If you view a result with Mid selected in the Analytica result window, the variable being viewed is evaluated in mid-mode. If you select any of the other result views -- Mean, Statistics, Bands, PDF, Prob Mass, CDF, Cum Mass, or Sample -- the variable is evaluated in sample-mode.

When most Analytica functions calls are evaluated, the expressions provided as parameters to the function call are usually evaluated in the same evaluation mode that the call is being evaluated in. This is referred to as context-mode, and it is the default. However, some functions can alter the evaluation mode when its parameters are evaluated. For example, the function Sample(x) always evaluates «x» in sample-mode, even when the function is being evaluated in mid-mode. The Mid(x) function does the opposite -- its parameter is always evaluated in mid mode, even if the call is evaluated from prob-mode.

When the running index to a statistical function is Run (either because Run is explicitly specified, or because the running index parameter was not specified), the data parameter(s) are always evaluated in sample-mode. However, if a running index other than Run is specified, then the data parameter(s) are evaluated in context-mode. So, the evaluation mode used by statistical functions is actually conditional on the running index.

When a function's parameters are declared, the following parameter qualifiers control which evaluation mode is used when the parameter is evaluated:

Context

The default if no evaluation-mode qualifier is specified. The parameter is evaluated in the current evaluation mode. See Context. Example declaration:

MyFunction(x: Context)

Determ

The parameter is evaluated in mid-mode. See Determ. Example declaration:

MyFunction(x: Determ)

Samp

The parameter is evaluated in sample-mode. If the result is not indexed by Run, an error message is issued. Example usages:

MyFunction(x: Sample)
MyFunction(x: Sample Array[Run])

In the second declaration, the dimensions of «x» are declared. When the Sample qualifier is used with dimensions, you will most often desire «x» to have the Run dimension inside the function, and therefore, you need to include the Run dimension in the list of dimensions. If you don't include it, as in x: Sample Array[I] or x: Sample Array[], Analytica will iterate over the Run dimension and supply your function with one sample point at a time.

Prob

The parameter is evaluated in sample-mode. Example usages:

MyFunction(x: Prob)
MyFunction(x: Prob Array[I, Run]; I: IndexType)
MyFunction(x: Prob Array All[Run])

When a Prob parameter is evaluated, no error is issued if the result doesn't contain the Run dimension. If dimensions are explicitly declared, as with the second and third examples, you must include the Run dimension in the list if you wish «x» to contain the run dimension when your function definition is evaluated. Since «x» might not contain the Run dimension when evaluated, if you rely on the Run dimension being there, the all qualifier ensures it is -- the value of «x» will be constant over that dimension if the result of evaluating «x» did not include the Run dimension.

ContextSamp

A parameter declared with the ContextSamp qualifier is treated differently depending on whether or not its declaration includes the Run dimension. If the Run dimension is present, the parameter is evaluated in sample-mode, but if it is not present, the parameter is evaluated in context mode. In addition, a ContextSamp parameter is treated differently if the unevaluated expression in a function call is the special system variable SampleWeighting. In this special case, the parameter is evaluated only if the Run dimension is declared for the parameter (yielding the value of SampleWeighting), otherwise, the parameter is not evaluated and a constant value of 1 is used. Example declaration:

MyStat(x: ContextSamp[I]; I: Index = Run)

In this example, if the function is called as MyStat(A) or MyStat(A, Run), the declaration of «x» is considered to Run dimension explicitly specified, so «A» is evaluated in sample-mode. However, when called as MyStat(A, J), «A» is evaluated in context -- either mid-mode or sample-mode depending on the current context. The usage MyStat(SampleWeighting, Run) would use the value of the system variable SampleWeighting for «x», while MyStat(SampleWeighting, J) would 1 as the first parameter value.

The special behavior of the ContextSamp qualifier when passed SampleWeighting is used by functions that are capable of computing weighted statistics (which includes all of the built-in statistics described here).

Importance Weighting

By default, statistical functions treat each sample point equally, i.e., with equal weighting. However, if desired, you can assign a different weight to each sample point, which the statistical functions will utilize. When a non-uniform weighting is present, this results in a weighted-statistic, such as a weighted-mean, weighted-variance, etc.

Importance weighting can be used advantageously in several ways.

Example Uses of Importance Weighting

Posterior Conditioning

Using weights of 0 and 1 provides an elementary way to compute posterior probabilities. You may have several chance variables in your model, and an expression that determines whether the combination of values is possible or impossible (i.e., possibly comparing it to an observation that has been provided). The expression evaluates to 0 or 1 for each sample, 0 being impossible and 1 being possible. Using this expression as the weight, the statistical functions return the statistics on the posterior. For example, if we let B denote this expression, Mean(x) would compute E[x|B]. PDF(x) would show p(x|B). And so on.

This form of posterior conditioning is useful when [math]\displaystyle{ P(B) }[/math], the probability of that a sample from the prior distribution is "possible", is large. The effective sample size is essentially reduced by a [math]\displaystyle{ P(\neg B) }[/math] proportion, so if B is very unlikely, you'd need a very large probability. (see also weighted posterior conditioning).

Using a sampling distribution for a target distribution

Sometimes one needs to Monte-Carlo sample from a complicated probability density, f(x), either uni- or multi-variate, for which the density can be computed, but for which there is no easy way to generate random variates. A solution is to sample from a different distribution, g(x), but to weight the samples so that the results and statistics computed are those for f(x) rather than for the sampling density g(x).

Suppose [math]\displaystyle{ \theta[p] }[/math] is a statistic for a distribution [math]\displaystyle{ p(x) }[/math], and that [math]\displaystyle{ \hat\theta(x|w) }[/math] is an unbiased estimator of the statistic based on a weighted sample. Then if we generate a sample [math]\displaystyle{ x_i ~ g }[/math] and use a weight of [math]\displaystyle{ w(x_i)=f(x_i)/g(x_i) }[/math] for each sample, [math]\displaystyle{ E[\hat\theta(x|w)]=\theta[f] }[/math]. In other words, even though we sample using g, our statistcs computatons are estimators for the statistics over f. Hence, in general, the sampling distribution does not need to be the same as the target distribution.

The use of a sampling distribution with sample weighting is most useful when the sample distribution is close to the target distribution. The more different they are, the greater the sample size required for convergence to a good estimate of the statistic. In addition, the range of the sampling distribution must span the range of the target distribution (as captured by various mathematical relations, such as non-zero kullback-liebler distance, or the measurability).

Focus on critical regions

In some models, certain "regions" in the space of uncertain outcomes are more critical to the bottom line than others. For example, the 2% of days in which the stock market crashes or rallies may be of more importance than the other 98% of days. A utility function may vary dramatically in a small region, but exhibit little variation elsewhere. Or the action that occurs in the tails, very-rare events, is of particular interest. In these situations, we prefer to have an increased coverage of the critical region in the generated samples, but we don't want the distorted sample to alter the actual expectations and other statistics computed by the model.

This is a special but important case of using a sampling distribution for a target distribution. In this case, the difficulty is not in generating the sample from the true distribution, it is simply the desire to get more coverage in certain regions. One method that can be used is to sample from the actual distribution, then sample also from only the critical region, and then use the critical region sample with probability p, so that your sampling distribution is a mixture of the true distribution and the critical region. The weight

[math]\displaystyle{ w(x_i) = {{f(x_i)} \over {(1-p) f(x_i) + p cr(x_i)}} }[/math]

where [math]\displaystyle{ pcr(x_i) }[/math] is the distribution used for sampling the critical region, recovers the statistics for the target distribution.

Weighted Posterior Conditioning

Simple posterior conditioning, mentioned above, has the disadvantage of performing very poorly when the condition/observation has a low probability of occurring relative to the prior. Unfortunately, this is actually the "normal" case in most Bayesian models. However, importance weighting can help with the solution here in some cases as well.

If you can approximate the posterior distribution through other techniques, then importance weighting can be used to tweak the final statistics to reflect the actual posterior. Again, this is an example of using a sampling distribution for a target distribution. It is notable that the weight factors can be changed by a constant factor without impacting the results. With posterior computations this is important, since the normalization factor is often very hard to compute, but a proportional density is relatively easy to come by.

Weighted Statistic Functions

All Analytica's built-in statistical functions accept an optional weight parameter named «w». The weighted-version of the statistic can be obtained by supplying the sample weights through this parameter. For example,

Mean(y, w: x > z)

Would compute the posterior expected value of «y» given that «x» is greater than «z». Or

SDeviation(x, w: targetDensity/sampleDensity)

would compute the standard deviation of «x» in the target density, given that the model sampled from the sample density, where the variable targetDensity contains the joint probability density for each joint sample in the model, and the variable sampleDensity contains the joint density that samples were generated from.

When the weight parameter is specified, and the intention is to have a non-uniform weighting, the weighting needs to be indexed by the running index. The weighted versions of all statistics can be used on historical data as well by specifying a different running index, e.g., Correlation(x, y, I, w: myWt).

Global Importance Weighting

The special system variable, SampleWeighting, contains the global weighting used by default for uncertain samples. SampleWeighting is by default 1.0 -- which is equivalent to Array(Run, 1) -- applying an equal weighting to all samples. However, by setting the definition of SampleWeighting to your own expression, you can control the weighting used by all statistical functions, including by the result windows. A global weighting thus provides a fact that a weighting is being used transparent. However, one must remain aware that chance variable definitions contain the sample distributions and not the target distributions.

The SampleWeighting value is used as the default weight parameter to all statistical functions when the running index is Run. If the running index is anything other than Run, a constant weight is used by default, or if a weighting is explicitly specified for the optional «w» parameter, that weighting is used to compute the statistic and SampleWeighting has no effect.

The system variable SampleWeighting could contain indexes other than Run. When this occurs, these indexes will appear in every statistical result, even when those indexes don't appear in any of the parameters (i.e., because they implicitly appear in the w parameter via its default). This means they will also appear in statistical results in a result window, even though they may not appear in the sample. But, this could be useful. For example, you could have decision variable defined as a choice with two values: ["Prior", "Posterior"]. Every result view would therefore contain both the prior and posterior values.

Graphing Importance Weights in Scatter Plots

When you graph the Sample result view as a scatter plot, you may wish to use the size of the symbol to indicate the importance weight of the point. This can be done by adding SampleWeighting as an exogenous variable (by clicking the XY button at the top-right), enabling the "Symbol Size" role in the Graph Settings → Key panel, and then setting the Symbol Size role pulldown to SampleWeighting.

Note: It might make sense to make this the default Symbol Size role always for scatter plots when the common index is Run.

Setting the SampleWeighting

The definition of the SampleWeighting system variable can be set from its object window. To get to the object window, de-select all nodes (e.g., by clicking in the background of the diagram) and on the menus navigate to Definition → System Variables → SampleWeighting.

Note: Unlike Time, which has a high-level "Edit Time" on the menus, SampleWeighting is intentionally kept less accessible, so as to not burden the normal user who is expected to never use the feature.

Function Reference

Mean(x, i, w)

See Mean.

Variance(x, i, w)

See Variance.

SDeviation(x, i, w)

Computes the weighted sample standard deviation -- the square root of the Variance.

See SDeviation and Variance.

Skewness(x, i, w)

See Skewness

Kurtosis(x, i, w)

Computes an estimate of the weighted kurtosis, a measure of the degree to which the distribution has a central peak. A normal distribution has zero kurtosis. A distribution with tails heavier than a normal, such as uniform distribution, has a negative kurtosis.

[math]\displaystyle{ \sum_i w_i \left({x-\bar{x}}\over\sigma\right)^4 / \sum_i w_i - 3 }[/math]

If «x» contains one or more infinite values, the kurtosis is -INF, unless the values are constant at INF (or -INF), in which case it is NaN.

GetFract(x, p, i, w, discrete)

See GetFract.

Probability(b; I: optional IndexType = Run; w)

See Probability

Frequency(x, a, i, w)

Frequency returns a count or histogram with the number of occurrences of each value of index «a» in «x», with the result indexed by «a». It works whether «x» and «a» contain numeric or text values. If «a» contains numbers in ascending order, it returns the number of values in «x» that are equal to or less than «a», and greater than the previous value of «a». If you don't specify index «i», evaluates «x» as a probability distribution and computes the frequency over index Run. Otherwise you can specify a different index «i» of «x» over which to count how often each «a» occurs in «x».

If you specify weight «w» for each value of Run (or «i»), it returns the weighted count. With the default value of 1 for the system variable SampleWeighting, Frequency returns the count of points, which is generally larger than 1. If you want the relative frequency of points in the sample, you can divide by Sum(SampleWeights, Run). If you want the frequency relative to those values in «A», you can divide the result by the result summed over «A».

You can also use Frequency to efficiently aggregate an array from a detailed index to a less detailed index. For example, if Revenue is indexed by Month, and you wish to aggregate (by summing) to Year:

Frequency(X: MonthToYear, A: Year, I: Month, w: Revenue)

This is equivalent to:

Aggregate(Revenue, MonthToYear, Month, Year)

where MonthToYear is an array, indexed by Month, having the value of Year in each cell. An equivalent expression would be

Sum((MonthToYear = Year)*revenue, Month)

but notice that this third method generates an intermediate value, MonthToYear = Year, that is indexed by Month and Year. It has a complexity of [math]\displaystyle{ O( |Month| \times |Year|) }[/math], while the Frequency method (and Aggregate) has a complexity of [math]\displaystyle{ O( |Month| ) }[/math]. Note: |Year| doesn't appear since the associative lookup uses an [math]\displaystyle{ O(1) }[/math] hash-table based lookup.

Correlation

See Correlation.

Rank Correlation

See RankCorrel.

Pdf

See Cdf and Pdf Functions.

Cdf

See Cdf and Pdf Functions.

ProbBands(x, I, w, discrete)

Computes a weighted probability bands result. The result of this function appears on a probability bands result view.

The percentiles returned are selected from the Uncertainty Dialog. If the function call appears in the definition of variable Va1, then the uncertainty settings for Va1 is used if it has been set. If the ProbBands call occurs in a user-defined function, and that function is called from Va1, the default setting is used. If it is called from a button script, or if the variable whose definition contains the call does not have a local setting specified, the global default Uncertainty Settings is used.

The result is indexed by a local "magic" index named Probability. The number of elements in this index may vary, and may change as the user change the Uncertainty Settings. For this reason, it is generally better to avoid using this function, and to use the GetFract function instead. In fact, ProbBands is almost identical to GetFract, with the only difference being that you specify the desired fractiles when calling GetFract, while ProbBands uses the UI settings.

See the description of GetFract for more details about the distinction and treatment of discrete versus continuous samples.

Statistics(x, I, w)

Computes a set of weighted statistics for «x». This is the result that appears in the Statistics view of a Result window.

The statistics selected are selected from the Statistics tab on the Uncertainty Settings dialog.

If the call to Statistics appears in the definition of Variable Va1, then Va1's local uncertainty settings are used if they are set. In all other cases, the global default Uncertainty Settings are used to select the statistics.

Your expression should never assume that any particular statistic will be present, since changes to the uncertainty settings will change the result. It is generally better to use the individual statistics functions described elsewhere on this page in an expression.

The optional domain parameter is relevant only to the median statistic. See the description for GetFract for more details. It would be highly unusual to explicitly set that parameter.

The Min and Max statistics, if they appear, are computed using CondMin(x, I, w) and CondMax(X, I, w), so that any points having a zero weight are not included in the Min or Max.

Min, Max, CondMin, CondMax

Min(x, I)
CondMin(x, I, b)

These functions are not statistical functions -- the first parameters are always evaluated in context mode, and the condition to CondMin and CondMax (which is the equivalent of Weight) does not default to SampleWeighting.

However, when Min and Max are shown on the Statistics view in the Result window, what is shown is the conditional min and max (i.e., CondMin, CondMax), using SampleWeighting > 0 as the condition. This means that points with zero weight are not included in the Min or Max.

See also

Comments


You are not allowed to post comments.