Difference between revisions of "Statistical Functions and Importance Weighting"

(Parameters of GetFract)
(15 intermediate revisions by 3 users not shown)
Line 2: Line 2:
 
[[Category:Doc Status C]] <!-- For Lumina use, do not change -->
 
[[Category:Doc Status C]] <!-- For Lumina use, do not change -->
  
[[What's new in Analytica 4.0?]] >
 
  
A statistical functions computes a quantity that summarizes some aspect of a sample of data.  These are reduction functions, in that the running dimension is not present in the result; however, in some cases, other dimensions may be introduced.  Many of these functions are used by the Analytica result views, and hence the detailed descriptions given here apply to these result views as well.
+
<div style="column-count:2;-moz-column-count:2;-webkit-column-count:2">
 +
__TOC__
 +
</div>
 +
 
 +
 
 +
A statistical function computes a quantity that summarizes some aspect of a sample of data.  These are reduction functions, in that the running dimension is not present in the result; however, in some cases, other dimensions may be introduced.  Many of these functions are used by the Analytica result views, and hence the detailed descriptions given here apply to these result views as well.
  
 
Statistical functions include:  [[Mean]], [[SDeviation]], [[Variance]], [[Skewness]], [[Kurtosis]], [[GetFract]], [[Correlation]], [[RankCorrel]], [[Frequency]], [[Probability]], [[ProbBands]], [[Statistics]], [[PDF]] and [[CDF]].   
 
Statistical functions include:  [[Mean]], [[SDeviation]], [[Variance]], [[Skewness]], [[Kurtosis]], [[GetFract]], [[Correlation]], [[RankCorrel]], [[Frequency]], [[Probability]], [[ProbBands]], [[Statistics]], [[PDF]] and [[CDF]].   
Line 10: Line 14:
 
Concepts common to all statistical functions are covered initially, then each function is described with greater precision.
 
Concepts common to all statistical functions are covered initially, then each function is described with greater precision.
  
= The Running Index =
+
== The Run Index ==
 
 
Statistical functions operate over a Running dimension, where each element of the dimension corresponds to one data point.  A data set may contain other dimensions, but these will simply be "array abstracted", with the statistical function applied independently to each slice along the extra dimensions.
 
 
 
By default, statistical functions use the special system index named [[Run]] as the running index.  So, for example, the expression: <code>[[SDeviation]](x)</code> would evaluate x (in prob-mode, see the following section), and compute the standard deviation of the result over the Run dimension.
 
 
 
Models often contain uncertain quantities, defined by distribution functions inside chance variable definitions. The default use of the [[Run]] dimension implies that the statistical functions are being applied to the uncertainty that has been modeled.
 
 
 
In some cases, a data set may come from an external source, such as a collection of historical data.  In this case, the running index (i.e., the dimension indexing the individual points in the historical data) would be a user-defined index.  Statistical functions can accept an optional parameter, «i», to identify a running index if it is something other than [[Run]].  Such a call might look like:  <code>[[SDeviation](x,I)</code>.
 
  
When a running index other than Run is specified, the sample parameter is evaluated in context, rather than in [[Evaluation modes|sample mode]] (see the next section).
+
Analytica represents any uncertain quantity as a random sample from a probability distribution over the built-in index [[Run]], which goes from 1 to [[SampleSize]]. By default, statistical functions force their main parameters x to be evaluated as a probability distribution, and they operate over  [[Run]]. For example, <code>Mean(X)</code> returns the mean estimated from the random sample for X.  
  
= Evaluation Mode =
+
You can also apply any statistical function over another index, if you specify it explicitly, for example, <code>Mean(X, i)</code> computes the mean value of array <code>X</code> over index <code>i</code>. If you specify an index other than [[Run]], a statistical function evaluates its main parameter <code>X</code>  as deterministic or probabilistic according to the context mode in which it is being evaluated (see the next section).
  
When Analytica evaluates an expression, it is always done within an [[evaluation mode]], either mid-mode or sample-mode.  The result computed by an expression may differ in the two modes if it contains uncertainty.   
+
== Evaluation Mode ==
 +
When Analytica evaluates an expression, it is always done within an [[Evaluation Modes|evaluation mode]], either mid-mode or sample-mode.  The result computed by an expression may differ in the two modes if it contains uncertainty.   
  
If you view a result with "Mid" selected in the Analytica result window, the variable being viewed is evaluated in mid-mode.  If you select any of the other result views -- [[Mean]], [[Statistics]], [[ProbBands|Bands]], [[PDF]], Prob Mass, [[CDF]], Cum Mass, or [[Sample]] -- the variable is evaluated in sample-mode.
+
If you view a result with [[Mid]] selected in the Analytica result window, the variable being viewed is evaluated in mid-mode.  If you select any of the other result views -- [[Mean]], [[Statistics]], [[ProbBands|Bands]], [[PDF]], Prob Mass, [[CDF]], Cum Mass, or [[Sample]] -- the variable is evaluated in sample-mode.
  
When most Analytica functions calls are evaluated, the expressions provided as parameters to the function call are usually evaluated in the same evaluation mode that the call is being evaluated in.  This is referred to as context-mode, and it is the default.  However, some functions can alter the evaluation mode when its parameters are evaluated.  For example, the function <code>[[Sample]](x)</code> always evaluates «x» in sample-mode, even when the function is being evaluated in mid-mode.  The <code>[[Mid]](x)</code> function does the opposite -- its parameter is always evaluated in mid mode, even if the call is evaluated from prob-mode.   
+
When most Analytica functions calls are evaluated, the expressions provided as parameters to the function call are usually evaluated in the same evaluation mode that the call is being evaluated in.  This is referred to as context-mode, and it is the default.  However, some functions can alter the evaluation mode when its parameters are evaluated.  For example, the function [[Sample]](x) always evaluates «x» in sample-mode, even when the function is being evaluated in mid-mode.  The [[Mid]](x) function does the opposite -- its parameter is always evaluated in mid mode, even if the call is evaluated from prob-mode.   
  
When the running index to a statistical function is [[Run]] (either because Run is explicitly specified, or because the running index parameter was not specified), the data parameter(s) are always evaluated in sample-mode.  However, if a running index other than Run is specified, then the data parameter(s) are evaluated in context-mode.  So, the evaluation mode used by statistical functions is actually conditional on the running index.
+
When the running index to a statistical function is [[Run]] (either because [[Run]] is explicitly specified, or because the running index parameter was not specified), the data parameter(s) are always evaluated in sample-mode.  However, if a running index other than [[Run]] is specified, then the data parameter(s) are evaluated in context-mode.  So, the evaluation mode used by statistical functions is actually conditional on the running index.
  
 
When a function's parameters are declared, the following parameter qualifiers control which evaluation mode is used when the parameter is evaluated:
 
When a function's parameters are declared, the following parameter qualifiers control which evaluation mode is used when the parameter is evaluated:
  
 
===Context===
 
===Context===
The default if no evaluation-mode qualifier is specified.  The parameter is evaluated in the current evaluation mode. Example declaration:
+
The default if no evaluation-mode qualifier is specified.  The parameter is evaluated in the current evaluation mode. See [[Function_Parameter_Qualifiers#Context|Context]]. Example declaration:
  
MyFunction( x : [[Function_Parameter_Qualifiers#Context|Context]] )
+
:<code>MyFunction(x: Context)</code>
  
 
=== Determ ===
 
=== Determ ===
 +
The parameter is evaluated in [[mid]]-mode.  See [[Function_Parameter_Qualifiers#Determ|Determ]]. Example declaration:
  
The parameter is evaluated in mid-mode.  Example declaration:
+
:<code>MyFunction(x: Determ)</code>
 
 
MyFunction( x : [[Function_Parameter_Qualifiers#Determ|Determ]] )
 
  
 
===Samp===  
 
===Samp===  
 
+
The parameter is evaluated in [[sample]]-mode.  If the result is not indexed by [[Run]], an error message is issued.
The parameter is evaluated in sample-mode.  If the result is not indexed by Run, an error message is issued.
 
 
Example usages:
 
Example usages:
  
MyFunction( x : [[Function_Parameter_Qualifiers#Sample|Sample]] )
+
:<code>MyFunction(x: Sample)</code>
 +
:<code>MyFunction(x: Sample Array[Run])</code>
  
MyFunction( x : [[Function_Parameter_Qualifiers#Sample|Sample]] [[Function_Parameter_Qualifiers#Array|Array]][Run] )
+
In the second declaration, the dimensions of «x» are declared.  When the [[Function_Parameter_Qualifiers#Sample|Sample]] qualifier is used with dimensions, you will most often desire «x» to have the [[Run]] dimension inside the function, and therefore, you need to include the [[Run]] dimension in the list of dimensions.  If you don't include it, as in <code>x: Sample Array[I]</code> or <code>x: Sample Array[]</code>, Analytica will iterate over the [[Run]] dimension and supply your function with one sample point at a time.
 
 
In the second declaration, the dimensions of «x» are declared.  When the [[Function_Parameter_Qualifiers#Sample|Sample]] qualifier is used with dimensions, you will most often desire «x» to have the Run dimension inside the function, and therefore, you need to include the Run dimension in the list of dimensions.  If you don't include it, as in "x : Sample Array[I]" or "x : Sample Array[]", Analytica will iterate over the Run dimension and supply your function with one sample point at a time.
 
  
 
=== Prob ===  
 
=== Prob ===  
 
The parameter is evaluated in sample-mode.
 
The parameter is evaluated in sample-mode.
 
Example usages:
 
Example usages:
 +
:<code>MyFunction(x: Prob)</code>
 +
:<code>MyFunction(x: Prob Array[I, Run]; I: IndexType)</code>
 +
:<code>MyFunction(x: Prob Array All[Run])</code>
  
MyFunction( x : [[Function_Parameter_Qualifiers#Prob|Prob]] )
+
When a [[Function_Parameter_Qualifiers#Prob|Prob]] parameter is evaluated, no error is issued if the result doesn't contain the [[Run]] dimension.  If dimensions are explicitly declared, as with the second and third examples, you must include the [[Run]] dimension in the list if you wish «x» to contain the run dimension when your function definition is evaluated.  Since «x» might not contain the [[Run]] dimension when evaluated, if you rely on the [[Run]] dimension being there, the all qualifier ensures it is -- the value of «x» will be constant over that dimension if the result of evaluating «x» did not include the [[Run]] dimension.
 
 
MyFunction( x : [[Function_Parameter_Qualifiers#Prob|Prob]] [[Function_Parameter_Qualifiers#Array|Array]][I,Run] ; I : IndexType
 
 
 
MyFunction( x : [[Function_Parameter_Qualifiers#Prob|Prob]] [[Function_Parameter_Qualifiers#Array|Array]] [[Function_Parameter_Qualifiers#All|All]][Run])
 
 
 
When a [[Function_Parameter_Qualifiers#Prob|Prob]] parameter is evaluated, no error is issued if the result doesn't contain the Run dimension.  If dimensions are explicitly declared, as with the second and third examples, you must include the Run dimension in the list if you wish «x» to contain the run dimension when your function definition is evaluated.  Since «x» might not contain the Run dimension when evaluated, if you rely on the Run dimension being there, the all qualifier ensures it is -- the value of «x» will be constant over that dimension if the result of evaluating «x» did not include the Run dimension.
 
  
 
=== ContextSamp ===
 
=== ContextSamp ===
A parameter declared with the [[Function_Parameter_Qualifiers#ContextSamp|ContextSamp]] qualifier is treated differently depending on whether or not its declaration includes the Run dimension.  If the Run dimension is present, the parameter is evaluated in sample-mode, but if it is not present, the parameter is evaluated in context mode.  In addition, a [[Function_Parameter_Qualifiers#ContextSamp|ContextSamp]] parameter is treated differently if the unevaluated expression in a function call is the special system variable [[SampleWeighting]].  In this special case, the parameter is evaluated only if the Run dimension is declared for the parameter (yielding the value of SampleWeighting), otherwise, the parameter is not evaluated and a constant value of 1 is used.  Example declaration:
+
A parameter declared with the [[Function_Parameter_Qualifiers#ContextSamp|ContextSamp]] qualifier is treated differently depending on whether or not its declaration includes the [[Run]] dimension.  If the [[Run]] dimension is present, the parameter is evaluated in sample-mode, but if it is not present, the parameter is evaluated in context mode.  In addition, a [[Function_Parameter_Qualifiers#ContextSamp|ContextSamp]] parameter is treated differently if the unevaluated expression in a function call is the special system variable [[SampleWeighting]].  In this special case, the parameter is evaluated only if the [[Run]] dimension is declared for the parameter (yielding the value of [[SampleWeighting]]), otherwise, the parameter is not evaluated and a constant value of 1 is used.  Example declaration:
  
MyStat( x : [[Function_Parameter_Qualifiers#ContextSamp|ContextSamp]][I] ; I : [[Function_Parameter_Qualifiers#Index|Index]]=Run )
+
:<code>MyStat(x: ContextSamp[I]; I: Index = Run)</code>
  
In this example, if the function is called as <code>MyStat(A)</code> or <code>MyStat(A,Run)</code>, the declaration of «x» is considered to Run dimension explicitly specified, so «A» is evaluated in sample-mode.  However, when called as <code>MyStat(A,J)</code>, «A» is evaluated in context -- either mid-mode or sample-mode depending on the current context.  The usage <code>MyStat(SampleWeighting,Run)</code> would use the value of the system variable SampleWeighting for «x», while <code>MyStat(SampleWeighting,J)</code> would 1 as the first parameter value.
+
In this example, if the function is called as <code>MyStat(A)</code> or <code>MyStat(A, Run)</code>, the declaration of «x» is considered to [[Run]] dimension explicitly specified, so «A» is evaluated in sample-mode.  However, when called as <code>MyStat(A, J)</code>, «A» is evaluated in context -- either mid-mode or sample-mode depending on the current context.  The usage <code>MyStat(SampleWeighting, Run)</code> would use the value of the system variable [[SampleWeighting]] for «x», while <code>MyStat(SampleWeighting, J)</code> would 1 as the first parameter value.
  
The special behavior of the [[Function_Parameter_Qualifiers#ContextSamp|ContextSamp]] qualifier when passed SampleWeighting is used by functions that are capable of computing weighted statistics (which includes all of the built-in statistics described here).
+
The special behavior of the [[Function_Parameter_Qualifiers#ContextSamp|ContextSamp]] qualifier when passed [[SampleWeighting]] is used by functions that are capable of computing weighted statistics (which includes all of the built-in statistics described here).
 
 
= Importance Weighting =
 
  
 +
== Importance Weighting ==
 
By default, statistical functions treat each sample point equally, i.e., with equal weighting.  However, if desired, you can assign a different weight to each sample point, which the statistical functions will utilize.  When a non-uniform weighting is present, this results in a weighted-statistic, such as a weighted-mean, weighted-variance, etc.   
 
By default, statistical functions treat each sample point equally, i.e., with equal weighting.  However, if desired, you can assign a different weight to each sample point, which the statistical functions will utilize.  When a non-uniform weighting is present, this results in a weighted-statistic, such as a weighted-mean, weighted-variance, etc.   
  
 
Importance weighting can be used advantageously in several ways.   
 
Importance weighting can be used advantageously in several ways.   
  
== Example Uses of Importance Weighting ==
+
=== Example Uses of Importance Weighting ===
 +
==== Posterior Conditioning ====
 +
Using weights of 0 and 1 provides an elementary way to compute posterior probabilities.  You may have several chance variables in your model, and an expression that determines whether the combination of values is possible or impossible (i.e., possibly comparing it to an observation that has been provided).  The expression evaluates to 0 or 1 for each sample, 0 being impossible and 1 being possible.  Using this expression as the weight, the statistical functions return the statistics on the posterior.  For example, if we let <code>B</code> denote this expression, [[Mean]](x) would compute ''E[x|B]''.  [[PDF]](x) would show ''p(x|B)''.  And so on.
  
=== Posterior Conditioning ===
+
This form of posterior conditioning is useful when <math>P(B)</math>, the probability of that a sample from the prior distribution is "possible", is large.  The effective sample size is essentially reduced by a <math>P(\neg B)</math> proportion, so if <code>B</code> is very unlikely, you'd need a very large probability. (see also weighted posterior conditioning).
  
Using weights of 0 and 1 provides an elementary way to compute posterior probabilities.  You may have several chance variables in your model, and an expression that determines whether the combination of values is possible or impossible (i.e., possibly comparing it to an observation that has been provided)The expression evaluates to 0 or 1 for each sample, 0 being impossible and 1 being possible.  Using this expression as the weight, the statistical functions return the statistics on the posterior.  For example, if we let B denote this expression, [[Mean]](x) would compute E[x|B].  PDF(x) would show p(x|B).  And so on.
+
==== Using a sampling distribution for a target distribution ====
 +
Sometimes one needs to Monte-Carlo sample from a complicated probability density, ''f(x)'', either uni- or multi-variate, for which the density can be computed, but for which there is no easy way to generate random variatesA solution is to sample from a different distribution, ''g(x)'', but to weight the samples so that the results and statistics computed are those for ''f(x)'' rather than for the sampling density ''g(x)''.
  
This form of posterior conditioning is useful when <math>P(B)</math>, the probability of that a sample from the prior distribution is "possible", is large.  The effective sample size is essentially reduced by a <math>P(\neg B)</math> proportion, so if B is very unlikely, you'd need a very large probability. (see also weighted posterior conditioning).
+
Suppose <math>\theta[p]</math> is a statistic for a distribution <math>p(x)</math>, and that <math>\hat\theta(x|w)</math> is an unbiased estimator of the statistic based on a weighted sample.  Then if we generate a sample <math>x_i ~ g</math> and use a weight of <math>w(x_i)=f(x_i)/g(x_i)</math> for each sample, <math>E[\hat\theta(x|w)]=\theta[f]</math>.  In other words, even though we sample using g, our statistcs computatons are estimators for the statistics over ''f''.  Hence, in general, the sampling distribution does not need to be the same as the target distribution.
 
 
=== Using a sampling distribution for a target distribution ===
 
 
 
Sometimes one needs to Monte-Carlo sample from a complicated probability density, f(x), either uni- or multi-variate, for which the density can be computed, but for which there is no easy way to generate random variates.  A solution is to sample from a different distribution, g(x), but to weight the samples so that the results and statistics computed are those for f(x) rather than for the sampling density g(x).
 
 
 
Suppose <math>\theta[p]</math> is a statistic for a distribution <math>p(x)</math>, and that <math>\hat\theta(x|w)</math> is an unbiased estimator of the statistic based on a weighted sample.  Then if we generate a sample <math>x_i ~ g</math> and use a weight of <math>w(x_i)=f(x_i)/g(x_i)</math> for each sample, <math>E[\hat\theta(x|w)]=\theta[f]</math>.  In other words, even though we sample using g, our statistcs computatons are estimators for the statistics over f.  Hence, in general, the sampling distribution does not need to be the same as the target distribution.
 
  
 
The use of a sampling distribution with sample weighting is most useful when the sample distribution is close to the target distribution.  The more different they are, the greater the sample size required for convergence to a good estimate of the statistic.  In addition, the range of the sampling distribution must span the range of the target distribution (as captured by various mathematical relations, such as non-zero kullback-liebler distance, or the measurability).
 
The use of a sampling distribution with sample weighting is most useful when the sample distribution is close to the target distribution.  The more different they are, the greater the sample size required for convergence to a good estimate of the statistic.  In addition, the range of the sampling distribution must span the range of the target distribution (as captured by various mathematical relations, such as non-zero kullback-liebler distance, or the measurability).
  
=== Focus on critical regions ===
+
==== Focus on critical regions ====
 
 
 
In some models, certain "regions" in the space of uncertain outcomes are more critical to the bottom line than others.  For example, the 2% of days in which the stock market crashes or rallies may be of more importance than the other 98% of days.  A utility function may vary dramatically in a small region, but exhibit little variation elsewhere.  Or the action that occurs in the tails, very-rare events, is of particular interest.  In these situations, we prefer to have an increased coverage of the critical region in the generated samples, but we don't want the distorted sample to alter the actual expectations and other statistics computed by the model.
 
In some models, certain "regions" in the space of uncertain outcomes are more critical to the bottom line than others.  For example, the 2% of days in which the stock market crashes or rallies may be of more importance than the other 98% of days.  A utility function may vary dramatically in a small region, but exhibit little variation elsewhere.  Or the action that occurs in the tails, very-rare events, is of particular interest.  In these situations, we prefer to have an increased coverage of the critical region in the generated samples, but we don't want the distorted sample to alter the actual expectations and other statistics computed by the model.
  
 
This is a special but important case of using a sampling distribution for a target distribution.  In this case, the difficulty is not in generating the sample from the true distribution, it is simply the desire to get more coverage in certain regions.  One method that can be used is to sample from the actual distribution, then sample also from only the critical region, and then use the critical region sample with probability p, so that your sampling distribution is a mixture of the true distribution and the critical region.  The weight  
 
This is a special but important case of using a sampling distribution for a target distribution.  In this case, the difficulty is not in generating the sample from the true distribution, it is simply the desire to get more coverage in certain regions.  One method that can be used is to sample from the actual distribution, then sample also from only the critical region, and then use the critical region sample with probability p, so that your sampling distribution is a mixture of the true distribution and the critical region.  The weight  
  
<math>w(x_i) = {{f(x_i)} \over {(1-p) f(x_i) + p cr(x_i)}}</math>
+
:<math>w(x_i) = {{f(x_i)} \over {(1-p) f(x_i) + p cr(x_i)}}</math>
  
 
where <math>pcr(x_i)</math> is the distribution used for sampling the critical region, recovers the statistics for the target distribution.
 
where <math>pcr(x_i)</math> is the distribution used for sampling the critical region, recovers the statistics for the target distribution.
  
=== Weighted Posterior Conditioning ===
+
==== Weighted Posterior Conditioning ====
 
+
Simple posterior conditioning, mentioned above, has the disadvantage of performing very poorly when the condition/observation has a low probability of occurring relative to the prior.  Unfortunately, this is actually the "normal" case in most Bayesian models.  However, importance weighting can help with the solution here in some cases as well.
Simple posterior conditioning, mentioned above, has the disadvantage of performing very poorly when the condition/observation has a low probability of occuring relative to the prior.  Unfortunately, this is actually the "normal" case in most Bayesian models.  However, importance weighting can help with the solution here in some cases as well.
 
  
 
If you can approximate the posterior distribution through other techniques, then importance weighting can be used to tweak the final statistics to reflect the actual posterior.  Again, this is an example of using a sampling distribution for a target distribution.  It is notable that the weight factors can be changed by a constant factor without impacting the results.  With posterior computations this is important, since the normalization factor is often very hard to compute, but a proportional density is relatively easy to come by.
 
If you can approximate the posterior distribution through other techniques, then importance weighting can be used to tweak the final statistics to reflect the actual posterior.  Again, this is an example of using a sampling distribution for a target distribution.  It is notable that the weight factors can be changed by a constant factor without impacting the results.  With posterior computations this is important, since the normalization factor is often very hard to compute, but a proportional density is relatively easy to come by.
  
== Weighted Statistic Functions ==
+
=== Weighted Statistic Functions ===
 +
All Analytica's built-in statistical functions accept an optional weight parameter named «w».  The weighted-version of the statistic can be obtained by supplying the sample weights through this parameter.  For example,
  
All Analytica's built-in statistical functions accept an optional weight parameter named w.  The weighted-version of the statistic can be obtained by supplying the sample weights through this parameter.  For example,
+
:<code>Mean(y, w: x > z)</code>
  
  [[Mean]](y,w:x>z)
+
Would compute the posterior expected value of «y» given that «x» is greater than «z». Or
  
Would compute the posterior expected value of y given that x is greater than z.  Or
+
:<code>SDeviation(x, w: targetDensity/sampleDensity)</code>
  
[[SDeviation]](x,w:targetDensity/sampleDensity )
+
would compute the standard deviation of «x» in the target density, given that the model sampled from the sample density, where the variable <code>targetDensity</code> contains the joint probability density for each joint sample in the model, and the variable <code>sampleDensity</code> contains the joint density that samples were generated from.
  
would compute the standard deviation of x in the target density, given that the model sampled from the sample density, where the variable targetDensity contains the joint probability density for each joint sample in the model, and the variable sampleDensity contains the joint density that samples were generated from.  
+
When the weight parameter is specified, and the intention is to have a non-uniform weighting, the weighting needs to be indexed by the running index.  The weighted versions of all statistics can be used on historical data as well by specifying a different running index, e.g., <code>Correlation(x, y, I, w: myWt)</code>.
  
When the weight parameter is specified, and the intention is to have a non-uniform weighting, the weighting needs to be indexed by the running indexThe weighted versions of all statistics can be used on historical data as well by specifying a different running index, e.g., <code>[[Correlation]](x,y,I,w:myWt)</code>.
+
=== Global Importance Weighting ===
 +
The special system variable, [[SampleWeighting]], contains the global weighting used by default for uncertain samples[[SampleWeighting]] is by default 1.0 -- which is equivalent to <code>Array(Run, 1)</code> -- applying an equal weighting to all samples.  However, by setting the definition of [[SampleWeighting]] to your own expression, you can control the weighting used by all statistical functions, including by the result windows.  A global weighting thus provides a fact that a weighting is being used transparent.  However, one must remain aware that chance variable definitions contain the sample distributions and not the target distributions.
  
== Global Importance Weighting ==
+
The [[SampleWeighting]] value is used as the default weight parameter to all statistical functions when the running index is [[Run]].  If the running index is anything other than [[Run]], a constant weight is used by default, or if a weighting is explicitly specified for the optional «w» parameter, that weighting is used to compute the statistic and [[SampleWeighting]] has no effect.
  
The special system variable, SampleWeighting, contains the global weighting used by default for uncertain samples. SampleWeighting is by default 1.0 -- which is equivalent to Array(Run,1) -- applying an equal weighting to all samplesHowever, by setting the definition of SampleWeighting to your own expression, you can control the weighting used by all statistical functions, including by the result windowsA global weighting thus provides a fact that a weighting is being used transparentHowever, one must remain aware that chance variable definitions contain the sample distributions and not the target distributions.
+
The system variable [[SampleWeighting]] could contain indexes other than [[Run]].  When this occurs, these indexes will appear in every statistical result, even when those indexes don't appear in any of the parameters (i.e., because they implicitly appear in the w parameter via its default).  This means they will also appear in statistical results in a result window, even though they may not appear in the sampleBut, this could be usefulFor example, you could have decision variable defined as a choice with two values: <code>["Prior", "Posterior"]</code>.  Every result view would therefore contain both the prior and posterior values.
  
The SampleWeighting value is used as the default weight parameter to all statistical functions when the running index is Run.  If the running index is anything other than Run, a constant weight is used by default, or if a weighting is explicitly specified for the optional w parameter, that weighting is used to compute the statistic and SampleWeighting has no effect.
+
=== Graphing Importance Weights in Scatter Plots ===
  
The system variable SampleWeighting could contain indexes other than Run.  When this occurs, these indexes will appear in every statistical result, even when those indexes don't appear in any of the parameters (i.e., because they implicitly appear in the w parameter via its default).  This means they will also appear in statistical results in a result window, even though they may not appear in the sample.  But, this could be useful.  For example, you could have decision variable defined as a choice with two values: ["Prior","Posterior"].  Every result view would therefore contain both the prior and posterior values.
+
When you graph the [[Sample]] result view as a scatter plot, you may wish to use the size of the symbol to indicate the importance weight of the point.  This can be done by adding [[SampleWeighting]] as an exogenous variable (by clicking the '''XY''' button at the top-right), enabling the "Symbol Size" role in the ''Graph Settings &rarr; Key'' panel, and then setting the Symbol Size role pulldown to [[SampleWeighting]].
  
== Graphing Importance Weights in Scatter Plots ==
+
Note: It might make sense to make this the default Symbol Size role always for scatter plots when the common index is [[Run]].
  
When you graph the Sample result view as a scatter plot, you may wish to use the size of the symbol to indicate the importance weight of the point. This can be done by adding SampleWeighting as an exogenous variable (by clicking the XY button at the top-right), enabling the "Symbol Size" role in the Graph Settings -> Key panel, and then setting the Symbol Size role pulldown to SampleWeighting.
+
=== Setting the SampleWeighting ===
 +
The definition of the [[SampleWeighting]] system variable can be set from its object window.  To get to the object window, de-select all nodes (e.g., by clicking in the background of the diagram) and on the menus navigate to ''Definition &rarr; System Variables &rarr; SampleWeighting''.
  
Note: It might make sense to make this the default Symbol Size role always for scatter plots when the common index is Run.
+
Note: Unlike [[Time]], which has a high-level "Edit Time" on the menus, [[SampleWeighting]] is intentionally kept less accessible, so as to not burden the normal user who is expected to never use the feature.
 
 
== Setting the SampleWeighting ==
 
 
 
The definition of the SampleWeighting system variable can be set from its object window.  To get to the object window, de-select all nodes (e.g., by clicking in the background of the diagram) and on the menus navigate to Definition -> System Variables -> SampleWeighting.
 
 
 
Note: Unlike Time, which has a high-level "Edit Time" on the menus, SampleWeighting is intentionally kept less accessible, so as to not burden the normal user who is expected to never use the feature.
 
 
 
= Function Reference =
 
 
 
== <div id="Mean">[[Mean]](x'', i, w'')</div> ==
 
  
 +
== Function Reference ==
 +
=== <div id="Mean">Mean(x'', i, w'')</div> ===
 
See [[Mean]].
 
See [[Mean]].
  
== <div id="Variance">[[Variance]](x'', i, w'')</div> ==
+
=== <div id="Variance">Variance(x'', i, w'')</div> ===
 
 
 
See [[Variance]].
 
See [[Variance]].
  
== <div id="SDeviation">[[SDeviation]](x'', i, w'')</div> ==
+
=== <div id="SDeviation">SDeviation(x'', i, w'')</div> ===
 
 
 
Computes the weighted sample standard deviation -- the square root of the [[Variance]].
 
Computes the weighted sample standard deviation -- the square root of the [[Variance]].
  
 
See [[SDeviation]] and [[Variance]].
 
See [[SDeviation]] and [[Variance]].
  
== <div id="Skewness">Skewness( x'', i, w'' )</div> ==
+
=== <div id="Skewness">Skewness(x'', i, w'')</div> ===
 
 
 
See [[Skewness]]
 
See [[Skewness]]
  
== <div id="Kurtosis">Kurtosis(x'', i, w'')</div> ==
+
=== <div id="Kurtosis">Kurtosis(x'', i, w'')</div> ===
 
 
 
Computes an estimate of the weighted kurtosis, a measure of the degree to which the distribution has a central peak.  A normal distribution has zero kurtosis.  A distribution with tails heavier than a normal, such as uniform distribution, has a negative kurtosis.
 
Computes an estimate of the weighted kurtosis, a measure of the degree to which the distribution has a central peak.  A normal distribution has zero kurtosis.  A distribution with tails heavier than a normal, such as uniform distribution, has a negative kurtosis.
  
<math>\sum_i w_i \left({x-\bar{x}}\over\sigma\right)^4 / \sum_i w_i - 3</math>
+
:<math>\sum_i w_i \left({x-\bar{x}}\over\sigma\right)^4 / \sum_i w_i - 3</math>
 
 
If x contains one or more infinite values, the kurtosis is -INF, unless the values are constant at INF (or -INF), in which case it is NAN.
 
  
== <div id="GetFract">GetFract(x, p'', i, w, discrete)</div> ==
+
If «x» contains one or more infinite values, the kurtosis is -[[INF]], unless the values are constant at [[INF]] (or -[[INF]]), in which case it is [[NaN]].
  
 +
=== <div id="GetFract">GetFract(x, p'', i, w, discrete'')</div> ===
 
See [[GetFract]].
 
See [[GetFract]].
  
== <div id="Probability">Probability( B : numeric ContextSamp[I] ; I : optional IndexType = Run ; w : optional NonNegative ContextSamp[I] = SampleWeighting)</div> ==
+
=== <div id="Probability">Probability(b''; I: optional IndexType = Run; w'')</div> ===
 
+
See [[Probability]]
Returns the probability of condition B in a weighted sample.  Probability is equivalent to Mean(B<>False,I,w).
 
  
== <div id="Frequency">Frequency( X : ContextSamp[I] ; A : IndexType ; I : IndexType=Run ; w : NonNegative ContextSamp[I] = SampleWeighting) </div> ==
+
=== <div id="Frequency">Frequency(x, a, ''i, w'') </div> ===
  
Frequency tallies the number of times values in A occur in a weighted sample X (essentially creating a histogram). Each sample is tallied with a weight of wThe result is indexed by A.
+
Frequency returns a count or histogram with the number of occurrences of each value of index «a» in «x», with the result indexed by «a». It works whether «x» and «a» contain numeric or text values. If «a» contains numbers in ascending order, it returns the number of values in «x» that are equal to or less than «a», and greater than the previous value of «a». If you don't specify index «i»,  evaluates «x» as a probability distribution and computes the frequency over index Run. Otherwise you can specify a different index «i» of «x» over which to count how often each «a» occurs in «x».   
  
When x contains all numeric values, and A contains a list of numbers in increasing order, then the value is the sum of weights for points in x that are equal to or less than A, but greater than the previous value of A.
+
If you specify weight «w» for each value of Run (or «i»), it returns the weighted count.  With the default value of 1 for the system variable [[SampleWeighting]], [[Frequency]] returns the count of points, which is generally larger than 1.  If you want the relative frequency of points in the  sample, you can divide by <code>Sum(SampleWeights, Run)</code>.  If you want the frequency relative to those values in «A», you can divide the result by the result summed over «A».
  
When x or A contain at least one non-numeric value, or A is not increasing, then Frequency returns the sum of weights for points x=A.
+
You can also use Frequency to efficiently aggregate an array from a detailed index to a less detailed index.  For example, if <code>Revenue</code> is indexed by <code>Month</code>, and you wish to aggregate (by summing) to <code>Year</code>:
  
With the default value of 1 for the system variable SampleWeighting, Frequency returns the count of points, which is generally larger than 1.  If you desire the relative frequency of points relative to the entire sample, then you can divide by Sum(SampleWeights,Run).  If you desire the relative frequency relative to those values in A, you can divide the result by the result summed over A.
+
:<code>Frequency(X: MonthToYear, A: Year, I: Month, w: Revenue)</code>
  
Frequency can also be used to efficiently aggregate an array from a detailed index to a less detailed index.  For example, if Revenue is indexed by Month, and you wish to aggregate (by summing) to Year, you can use frequency as follows:
+
This is equivalent to:
 +
:<code>Aggregate(Revenue, MonthToYear, Month, Year)</code>
  
Frequency( X:MonthToYear, A:Year, I:Month, w:Revenue )
+
where <code>MonthToYear</code> is an array, indexed by <code>Month</code>, having the value of <code>Year</code> in each cell.  An equivalent expression would be
  
In Analytica 4.2, this is equivalent to:
+
:<code>Sum((MonthToYear = Year)*revenue, Month)</code>
[[Aggregate]](Revenue,MonthToYear,Month,Year)
 
  
where MonthToYear is an array, indexed by Month, having the value of Year in each cell. An equivalent expression would be
+
but notice that this third method generates an intermediate value, <code>MonthToYear = Year</code>, that is indexed by <code>Month</code> and <code>Year</code>.  It has a complexity of <math>O( |Month| \times |Year|)</math>, while the [[Frequency]] method (and [[Aggregate]]) has a complexity of <math>O( |Month| )</math>.  ''Note'': <code>|Year|</code> doesn't appear since the associative lookup uses an <math>O(1)</math> hash-table based lookup.
  
[[Sum]]( (MonthToYear=Year)*revenue, Month )
+
=== Correlation ===
 
+
See [[Correlation]].
but notice that this second method generates an intermediate value, $MonthToYear=Year$, that is indexed by Month and Year.  The second method has a complexity of $O( |Month| \times |Year|)$, while the Frequency method has a complexity of $O( |Month| )$.  Note: |Year| doesn't appear since the associative lookup uses an $O(1)$ hash-table based lookup.
 
 
 
== [[Correlation]] ==
 
 
 
See [[Correlation ]].
 
 
 
== Rank Correlation ==
 
  
 +
=== Rank Correlation ===
 
See [[RankCorrel]].
 
See [[RankCorrel]].
  
== [[Pdf]] ==
+
=== Pdf ===
 
 
 
See [[Cdf and Pdf Functions]].
 
See [[Cdf and Pdf Functions]].
  
== [[Cdf]] ==
+
=== Cdf ===
 
 
 
See [[Cdf and Pdf Functions]].
 
See [[Cdf and Pdf Functions]].
  
== <div id="ProbBands">ProbBands( X : ContextSamp[I] ; I : optional IndexType = Run ; w : NonNegative ContextSamp[I] = SampleWeighting ; discrete : optional boolean ; domain : optional Unevaluated = x )</div> ==
+
=== <div id="ProbBands">ProbBands(x'', I, w, discrete'')</div> ===
 
 
 
Computes a weighted probability bands result.  The result of this function appears on a probability bands result view.   
 
Computes a weighted probability bands result.  The result of this function appears on a probability bands result view.   
  
The percentiles returned are selected from the Uncertainty Dialog.  If the function call appears in the definition of variable Va1, then the uncertainty settings for Va1 is used if it has been set.  If the ProbBands call occurs in a user-defined function, and that function is called from Va1, the default setting is used.  If it is called from a button script, or if the variable whose definition contains the call does not have a local setting specified, the global default Uncertainty Settings is used.
+
The percentiles returned are selected from the ''Uncertainty Dialog''.  If the function call appears in the definition of variable <code>Va1</code>, then the uncertainty settings for <code>Va1</code> is used if it has been set.  If the [[ProbBands]] call occurs in a user-defined function, and that function is called from <code>Va1</code>, the default setting is used.  If it is called from a button script, or if the variable whose definition contains the call does not have a local setting specified, the global default Uncertainty Settings is used.
  
The result is indexed by a local "magic" index named Probability.  The number of elements in this index may vary, and may change as the user change the Uncertainty Settings.  For this reason, it is generally better to avoid using this function, and to use the GetFract function instead.  In fact, ProbBands is almost identical to GetFract, with the only difference being that you specify the desired fractiles when calling GetFract, while ProbBands uses the UI settings.
+
The result is indexed by a local "magic" index named [[Probability]].  The number of elements in this index may vary, and may change as the user change the ''Uncertainty Settings''.  For this reason, it is generally better to avoid using this function, and to use the [[GetFract]] function instead.  In fact, [[ProbBands]] is almost identical to [[GetFract]], with the only difference being that you specify the desired fractiles when calling [[GetFract]], while [[ProbBands]] uses the UI settings.
  
See the description of GetFract for more details about the distinction and treatment of discrete versus continuous samples.
+
See the description of [[GetFract]] for more details about the distinction and treatment of discrete versus continuous samples.
  
== <div id="Statistics">Statistics( X : ContextSamp[I] ; I : IndexType = Run ; w : NonNegative ContextSamp[I] = SampleWeighting ; domain : optional Unevaluated = x )</div> ==
+
=== <div id="Statistics">Statistics(x'', I, w'')</div> ===
  
Computes a set of weighted statistics for X.  This is the result that appears in the Statistics view of a result window.
+
Computes a set of weighted statistics for «x».  This is the result that appears in the '''Statistics''' view of a '''Result''' window.
  
The statistics selected are selected from the Statistics tab on the Uncertainty Settings dialog.
+
The statistics selected are selected from the Statistics tab on the ''Uncertainty Settings'' dialog.
  
If the call to Statistics appears in the definition of Variable Va1, then Va1's local uncertainty settings are used if they are set.  In all other cases, the global default Uncertainty Settings are used to select the statistics.
+
If the call to [[Statistics]] appears in the definition of Variable <code>Va1</code>, then <code>Va1</code>'s local uncertainty settings are used if they are set.  In all other cases, the global default ''Uncertainty Settings'' are used to select the statistics.
  
 
Your expression should never assume that any particular statistic will be present, since changes to the uncertainty settings will change the result.  It is generally better to use the individual statistics functions described elsewhere on this page in an expression.
 
Your expression should never assume that any particular statistic will be present, since changes to the uncertainty settings will change the result.  It is generally better to use the individual statistics functions described elsewhere on this page in an expression.
  
The optional domain parameter is relevant only to the median statistic.  See the description for GetFract for more details.  It would be highly unusual to explicitly set that parameter.
+
The optional domain parameter is relevant only to the median statistic.  See the description for [[GetFract]] for more details.  It would be highly unusual to explicitly set that parameter.
 +
 
 +
The [[Min]] and [[Max]] statistics, if they appear, are computed using <code>[[CondMin]](x, I, w)</code> and <code>[[CondMax]](X, I, w)</code>, so that any points having a zero weight are not included in the [[Min]] or [[Max]].
  
The Min and Max statistics, if they appear, are computed using CondMin(X,I,w) and CondMax(X,I,w), so that any points having a zero weight are not included in the min or max.
+
=== Min, Max, CondMin, CondMax ===
  
== Min, Max, CondMin, CondMax ==
+
:<code>Min(x'', I'')</code>
 +
:<code>CondMin(x'', I, b'')</code>
  
Min(X : Context Vector[I] ; I : optional IndexType)
+
These functions are not statistical functions -- the first parameters are always evaluated in context mode, and the condition to [[CondMin]] and [[CondMax]] (which is the equivalent of Weight) does not default to [[SampleWeighting]].
CondMin(X : Context Vector[I] ; I : optional IndexType ; B : optional Boolean Context[I])
 
  
These functions are not statistical functions -- the first parameters are always evaluated in context mode, and the condition to CondMin and CondMax (which is the equivalent of Weight) does not default to SampleWeighting.
+
However, when [[Min]] and [[Max]] are shown on the '''Statistics''' view in the [[Result window]], what is shown is the conditional min and max (i.e., [[CondMin]], [[CondMax]]), using [[SampleWeighting]] > 0 as the condition.  This means that points with zero weight are not included in the [[Min]] or [[Max]].
  
However, when Min and Max are shown on the Statistics view in the result window, what is shown is the conditional min and max (i.e., CondMin, CondMax), using SampleWeighting>0 as the condition.  This means that points with zero weight are not included in the Min or Max.
+
== See also ==
 +
* [[Importance weights]]
 +
* [[Importance analysis]]
 +
* [[Tutorial: Analyzing a model#Importance_analysis|Tutorial: Importance analysis]]
 +
* [[Statistics, Sensitivity, and Uncertainty Analysis]]
 +
* [[Statistical functions]]
 +
* [[Statistics]]
 +
* [[Probability]]
 +
* [[Function_Parameter_Qualifiers#Prob|Prob]]
 +
* [[Frequency]]
 +
* [[Uncertainty Setup dialog]]
 +
* [[What's new in Analytica 4.0?]]
 +
* [[Function calls and parameters]]

Revision as of 01:27, 25 May 2023



A statistical function computes a quantity that summarizes some aspect of a sample of data. These are reduction functions, in that the running dimension is not present in the result; however, in some cases, other dimensions may be introduced. Many of these functions are used by the Analytica result views, and hence the detailed descriptions given here apply to these result views as well.

Statistical functions include: Mean, SDeviation, Variance, Skewness, Kurtosis, GetFract, Correlation, RankCorrel, Frequency, Probability, ProbBands, Statistics, PDF and CDF.

Concepts common to all statistical functions are covered initially, then each function is described with greater precision.

The Run Index

Analytica represents any uncertain quantity as a random sample from a probability distribution over the built-in index Run, which goes from 1 to SampleSize. By default, statistical functions force their main parameters x to be evaluated as a probability distribution, and they operate over Run. For example, Mean(X) returns the mean estimated from the random sample for X.

You can also apply any statistical function over another index, if you specify it explicitly, for example, Mean(X, i) computes the mean value of array X over index i. If you specify an index other than Run, a statistical function evaluates its main parameter X as deterministic or probabilistic according to the context mode in which it is being evaluated (see the next section).

Evaluation Mode

When Analytica evaluates an expression, it is always done within an evaluation mode, either mid-mode or sample-mode. The result computed by an expression may differ in the two modes if it contains uncertainty.

If you view a result with Mid selected in the Analytica result window, the variable being viewed is evaluated in mid-mode. If you select any of the other result views -- Mean, Statistics, Bands, PDF, Prob Mass, CDF, Cum Mass, or Sample -- the variable is evaluated in sample-mode.

When most Analytica functions calls are evaluated, the expressions provided as parameters to the function call are usually evaluated in the same evaluation mode that the call is being evaluated in. This is referred to as context-mode, and it is the default. However, some functions can alter the evaluation mode when its parameters are evaluated. For example, the function Sample(x) always evaluates «x» in sample-mode, even when the function is being evaluated in mid-mode. The Mid(x) function does the opposite -- its parameter is always evaluated in mid mode, even if the call is evaluated from prob-mode.

When the running index to a statistical function is Run (either because Run is explicitly specified, or because the running index parameter was not specified), the data parameter(s) are always evaluated in sample-mode. However, if a running index other than Run is specified, then the data parameter(s) are evaluated in context-mode. So, the evaluation mode used by statistical functions is actually conditional on the running index.

When a function's parameters are declared, the following parameter qualifiers control which evaluation mode is used when the parameter is evaluated:

Context

The default if no evaluation-mode qualifier is specified. The parameter is evaluated in the current evaluation mode. See Context. Example declaration:

MyFunction(x: Context)

Determ

The parameter is evaluated in mid-mode. See Determ. Example declaration:

MyFunction(x: Determ)

Samp

The parameter is evaluated in sample-mode. If the result is not indexed by Run, an error message is issued. Example usages:

MyFunction(x: Sample)
MyFunction(x: Sample Array[Run])

In the second declaration, the dimensions of «x» are declared. When the Sample qualifier is used with dimensions, you will most often desire «x» to have the Run dimension inside the function, and therefore, you need to include the Run dimension in the list of dimensions. If you don't include it, as in x: Sample Array[I] or x: Sample Array[], Analytica will iterate over the Run dimension and supply your function with one sample point at a time.

Prob

The parameter is evaluated in sample-mode. Example usages:

MyFunction(x: Prob)
MyFunction(x: Prob Array[I, Run]; I: IndexType)
MyFunction(x: Prob Array All[Run])

When a Prob parameter is evaluated, no error is issued if the result doesn't contain the Run dimension. If dimensions are explicitly declared, as with the second and third examples, you must include the Run dimension in the list if you wish «x» to contain the run dimension when your function definition is evaluated. Since «x» might not contain the Run dimension when evaluated, if you rely on the Run dimension being there, the all qualifier ensures it is -- the value of «x» will be constant over that dimension if the result of evaluating «x» did not include the Run dimension.

ContextSamp

A parameter declared with the ContextSamp qualifier is treated differently depending on whether or not its declaration includes the Run dimension. If the Run dimension is present, the parameter is evaluated in sample-mode, but if it is not present, the parameter is evaluated in context mode. In addition, a ContextSamp parameter is treated differently if the unevaluated expression in a function call is the special system variable SampleWeighting. In this special case, the parameter is evaluated only if the Run dimension is declared for the parameter (yielding the value of SampleWeighting), otherwise, the parameter is not evaluated and a constant value of 1 is used. Example declaration:

MyStat(x: ContextSamp[I]; I: Index = Run)

In this example, if the function is called as MyStat(A) or MyStat(A, Run), the declaration of «x» is considered to Run dimension explicitly specified, so «A» is evaluated in sample-mode. However, when called as MyStat(A, J), «A» is evaluated in context -- either mid-mode or sample-mode depending on the current context. The usage MyStat(SampleWeighting, Run) would use the value of the system variable SampleWeighting for «x», while MyStat(SampleWeighting, J) would 1 as the first parameter value.

The special behavior of the ContextSamp qualifier when passed SampleWeighting is used by functions that are capable of computing weighted statistics (which includes all of the built-in statistics described here).

Importance Weighting

By default, statistical functions treat each sample point equally, i.e., with equal weighting. However, if desired, you can assign a different weight to each sample point, which the statistical functions will utilize. When a non-uniform weighting is present, this results in a weighted-statistic, such as a weighted-mean, weighted-variance, etc.

Importance weighting can be used advantageously in several ways.

Example Uses of Importance Weighting

Posterior Conditioning

Using weights of 0 and 1 provides an elementary way to compute posterior probabilities. You may have several chance variables in your model, and an expression that determines whether the combination of values is possible or impossible (i.e., possibly comparing it to an observation that has been provided). The expression evaluates to 0 or 1 for each sample, 0 being impossible and 1 being possible. Using this expression as the weight, the statistical functions return the statistics on the posterior. For example, if we let B denote this expression, Mean(x) would compute E[x|B]. PDF(x) would show p(x|B). And so on.

This form of posterior conditioning is useful when [math]\displaystyle{ P(B) }[/math], the probability of that a sample from the prior distribution is "possible", is large. The effective sample size is essentially reduced by a [math]\displaystyle{ P(\neg B) }[/math] proportion, so if B is very unlikely, you'd need a very large probability. (see also weighted posterior conditioning).

Using a sampling distribution for a target distribution

Sometimes one needs to Monte-Carlo sample from a complicated probability density, f(x), either uni- or multi-variate, for which the density can be computed, but for which there is no easy way to generate random variates. A solution is to sample from a different distribution, g(x), but to weight the samples so that the results and statistics computed are those for f(x) rather than for the sampling density g(x).

Suppose [math]\displaystyle{ \theta[p] }[/math] is a statistic for a distribution [math]\displaystyle{ p(x) }[/math], and that [math]\displaystyle{ \hat\theta(x|w) }[/math] is an unbiased estimator of the statistic based on a weighted sample. Then if we generate a sample [math]\displaystyle{ x_i ~ g }[/math] and use a weight of [math]\displaystyle{ w(x_i)=f(x_i)/g(x_i) }[/math] for each sample, [math]\displaystyle{ E[\hat\theta(x|w)]=\theta[f] }[/math]. In other words, even though we sample using g, our statistcs computatons are estimators for the statistics over f. Hence, in general, the sampling distribution does not need to be the same as the target distribution.

The use of a sampling distribution with sample weighting is most useful when the sample distribution is close to the target distribution. The more different they are, the greater the sample size required for convergence to a good estimate of the statistic. In addition, the range of the sampling distribution must span the range of the target distribution (as captured by various mathematical relations, such as non-zero kullback-liebler distance, or the measurability).

Focus on critical regions

In some models, certain "regions" in the space of uncertain outcomes are more critical to the bottom line than others. For example, the 2% of days in which the stock market crashes or rallies may be of more importance than the other 98% of days. A utility function may vary dramatically in a small region, but exhibit little variation elsewhere. Or the action that occurs in the tails, very-rare events, is of particular interest. In these situations, we prefer to have an increased coverage of the critical region in the generated samples, but we don't want the distorted sample to alter the actual expectations and other statistics computed by the model.

This is a special but important case of using a sampling distribution for a target distribution. In this case, the difficulty is not in generating the sample from the true distribution, it is simply the desire to get more coverage in certain regions. One method that can be used is to sample from the actual distribution, then sample also from only the critical region, and then use the critical region sample with probability p, so that your sampling distribution is a mixture of the true distribution and the critical region. The weight

[math]\displaystyle{ w(x_i) = {{f(x_i)} \over {(1-p) f(x_i) + p cr(x_i)}} }[/math]

where [math]\displaystyle{ pcr(x_i) }[/math] is the distribution used for sampling the critical region, recovers the statistics for the target distribution.

Weighted Posterior Conditioning

Simple posterior conditioning, mentioned above, has the disadvantage of performing very poorly when the condition/observation has a low probability of occurring relative to the prior. Unfortunately, this is actually the "normal" case in most Bayesian models. However, importance weighting can help with the solution here in some cases as well.

If you can approximate the posterior distribution through other techniques, then importance weighting can be used to tweak the final statistics to reflect the actual posterior. Again, this is an example of using a sampling distribution for a target distribution. It is notable that the weight factors can be changed by a constant factor without impacting the results. With posterior computations this is important, since the normalization factor is often very hard to compute, but a proportional density is relatively easy to come by.

Weighted Statistic Functions

All Analytica's built-in statistical functions accept an optional weight parameter named «w». The weighted-version of the statistic can be obtained by supplying the sample weights through this parameter. For example,

Mean(y, w: x > z)

Would compute the posterior expected value of «y» given that «x» is greater than «z». Or

SDeviation(x, w: targetDensity/sampleDensity)

would compute the standard deviation of «x» in the target density, given that the model sampled from the sample density, where the variable targetDensity contains the joint probability density for each joint sample in the model, and the variable sampleDensity contains the joint density that samples were generated from.

When the weight parameter is specified, and the intention is to have a non-uniform weighting, the weighting needs to be indexed by the running index. The weighted versions of all statistics can be used on historical data as well by specifying a different running index, e.g., Correlation(x, y, I, w: myWt).

Global Importance Weighting

The special system variable, SampleWeighting, contains the global weighting used by default for uncertain samples. SampleWeighting is by default 1.0 -- which is equivalent to Array(Run, 1) -- applying an equal weighting to all samples. However, by setting the definition of SampleWeighting to your own expression, you can control the weighting used by all statistical functions, including by the result windows. A global weighting thus provides a fact that a weighting is being used transparent. However, one must remain aware that chance variable definitions contain the sample distributions and not the target distributions.

The SampleWeighting value is used as the default weight parameter to all statistical functions when the running index is Run. If the running index is anything other than Run, a constant weight is used by default, or if a weighting is explicitly specified for the optional «w» parameter, that weighting is used to compute the statistic and SampleWeighting has no effect.

The system variable SampleWeighting could contain indexes other than Run. When this occurs, these indexes will appear in every statistical result, even when those indexes don't appear in any of the parameters (i.e., because they implicitly appear in the w parameter via its default). This means they will also appear in statistical results in a result window, even though they may not appear in the sample. But, this could be useful. For example, you could have decision variable defined as a choice with two values: ["Prior", "Posterior"]. Every result view would therefore contain both the prior and posterior values.

Graphing Importance Weights in Scatter Plots

When you graph the Sample result view as a scatter plot, you may wish to use the size of the symbol to indicate the importance weight of the point. This can be done by adding SampleWeighting as an exogenous variable (by clicking the XY button at the top-right), enabling the "Symbol Size" role in the Graph Settings → Key panel, and then setting the Symbol Size role pulldown to SampleWeighting.

Note: It might make sense to make this the default Symbol Size role always for scatter plots when the common index is Run.

Setting the SampleWeighting

The definition of the SampleWeighting system variable can be set from its object window. To get to the object window, de-select all nodes (e.g., by clicking in the background of the diagram) and on the menus navigate to Definition → System Variables → SampleWeighting.

Note: Unlike Time, which has a high-level "Edit Time" on the menus, SampleWeighting is intentionally kept less accessible, so as to not burden the normal user who is expected to never use the feature.

Function Reference

Mean(x, i, w)

See Mean.

Variance(x, i, w)

See Variance.

SDeviation(x, i, w)

Computes the weighted sample standard deviation -- the square root of the Variance.

See SDeviation and Variance.

Skewness(x, i, w)

See Skewness

Kurtosis(x, i, w)

Computes an estimate of the weighted kurtosis, a measure of the degree to which the distribution has a central peak. A normal distribution has zero kurtosis. A distribution with tails heavier than a normal, such as uniform distribution, has a negative kurtosis.

[math]\displaystyle{ \sum_i w_i \left({x-\bar{x}}\over\sigma\right)^4 / \sum_i w_i - 3 }[/math]

If «x» contains one or more infinite values, the kurtosis is -INF, unless the values are constant at INF (or -INF), in which case it is NaN.

GetFract(x, p, i, w, discrete)

See GetFract.

Probability(b; I: optional IndexType = Run; w)

See Probability

Frequency(x, a, i, w)

Frequency returns a count or histogram with the number of occurrences of each value of index «a» in «x», with the result indexed by «a». It works whether «x» and «a» contain numeric or text values. If «a» contains numbers in ascending order, it returns the number of values in «x» that are equal to or less than «a», and greater than the previous value of «a». If you don't specify index «i», evaluates «x» as a probability distribution and computes the frequency over index Run. Otherwise you can specify a different index «i» of «x» over which to count how often each «a» occurs in «x».

If you specify weight «w» for each value of Run (or «i»), it returns the weighted count. With the default value of 1 for the system variable SampleWeighting, Frequency returns the count of points, which is generally larger than 1. If you want the relative frequency of points in the sample, you can divide by Sum(SampleWeights, Run). If you want the frequency relative to those values in «A», you can divide the result by the result summed over «A».

You can also use Frequency to efficiently aggregate an array from a detailed index to a less detailed index. For example, if Revenue is indexed by Month, and you wish to aggregate (by summing) to Year:

Frequency(X: MonthToYear, A: Year, I: Month, w: Revenue)

This is equivalent to:

Aggregate(Revenue, MonthToYear, Month, Year)

where MonthToYear is an array, indexed by Month, having the value of Year in each cell. An equivalent expression would be

Sum((MonthToYear = Year)*revenue, Month)

but notice that this third method generates an intermediate value, MonthToYear = Year, that is indexed by Month and Year. It has a complexity of [math]\displaystyle{ O( |Month| \times |Year|) }[/math], while the Frequency method (and Aggregate) has a complexity of [math]\displaystyle{ O( |Month| ) }[/math]. Note: |Year| doesn't appear since the associative lookup uses an [math]\displaystyle{ O(1) }[/math] hash-table based lookup.

Correlation

See Correlation.

Rank Correlation

See RankCorrel.

Pdf

See Cdf and Pdf Functions.

Cdf

See Cdf and Pdf Functions.

ProbBands(x, I, w, discrete)

Computes a weighted probability bands result. The result of this function appears on a probability bands result view.

The percentiles returned are selected from the Uncertainty Dialog. If the function call appears in the definition of variable Va1, then the uncertainty settings for Va1 is used if it has been set. If the ProbBands call occurs in a user-defined function, and that function is called from Va1, the default setting is used. If it is called from a button script, or if the variable whose definition contains the call does not have a local setting specified, the global default Uncertainty Settings is used.

The result is indexed by a local "magic" index named Probability. The number of elements in this index may vary, and may change as the user change the Uncertainty Settings. For this reason, it is generally better to avoid using this function, and to use the GetFract function instead. In fact, ProbBands is almost identical to GetFract, with the only difference being that you specify the desired fractiles when calling GetFract, while ProbBands uses the UI settings.

See the description of GetFract for more details about the distinction and treatment of discrete versus continuous samples.

Statistics(x, I, w)

Computes a set of weighted statistics for «x». This is the result that appears in the Statistics view of a Result window.

The statistics selected are selected from the Statistics tab on the Uncertainty Settings dialog.

If the call to Statistics appears in the definition of Variable Va1, then Va1's local uncertainty settings are used if they are set. In all other cases, the global default Uncertainty Settings are used to select the statistics.

Your expression should never assume that any particular statistic will be present, since changes to the uncertainty settings will change the result. It is generally better to use the individual statistics functions described elsewhere on this page in an expression.

The optional domain parameter is relevant only to the median statistic. See the description for GetFract for more details. It would be highly unusual to explicitly set that parameter.

The Min and Max statistics, if they appear, are computed using CondMin(x, I, w) and CondMax(X, I, w), so that any points having a zero weight are not included in the Min or Max.

Min, Max, CondMin, CondMax

Min(x, I)
CondMin(x, I, b)

These functions are not statistical functions -- the first parameters are always evaluated in context mode, and the condition to CondMin and CondMax (which is the equivalent of Weight) does not default to SampleWeighting.

However, when Min and Max are shown on the Statistics view in the Result window, what is shown is the conditional min and max (i.e., CondMin, CondMax), using SampleWeighting > 0 as the condition. This means that points with zero weight are not included in the Min or Max.

See also

Comments


You are not allowed to post comments.