Multivariate distributions

A multivariate distribution is a probability distribution over an array of quantities — or, equivalently, an array of distributions. Analytica’s Intelligent Array features make it relatively easy to generate multivariate distributions. There are three main ways:

  • To create an array of identical independent distributions, use the «Over» parameter.
  • To create an array of independent distributions with different parameters, pass array(s) of parameter values to the function.
  • To create an array of dependent distributions, with specified correlations or covariances, use a function from the Multivariate Distributions library.

The Over parameter to probability distributions

The simplest way to generate an array of identical, independent distributions is to use the Over parameter. For example:

Normal(10, 2, over: K)

generates an array of independent normal distributions, each with mean 10 and standard deviation 2, over index K.

All parametric distributions accept «Over» as an optional parameter. «Over» allows multiple indexes if you want to create a multidimensional array of identically distributed quantities. For example, this generates a three-dimensional array of independent, identically distributed uniform distributions:

Uniform(0, 10, over: I, J, K)

Probability distributions with array parameters

Probability distribution functions fully support Intelligent Arrays. If a parameter is an array, the function generates an array of independent distributions over any index(es) of the array. For example:

Index K := ['A', 'B', 'C']
Variable Xmean := Table(K)(10, 11, 12)
Variable X := Normal(Xmean, 2)

X is an array of normal distributions over index K, each with the corresponding mean. If you define a normal distribution with two parameters (mean and standard deviation) with the same Index(es) — in this case, Xmean and Ysd are both indexed by K:

Variable Ysd := Table(K)(2, 3, 4)
Variable Y := Normal(Xmean, Ysdeviation)

it generates an array of normal distributions over index K, each with corresponding mean and standard deviation. More generally, the result is an array with the union of the indexes of all its parameters — just the same as all other functions and operations that support Intelligent Arrays.

The custom probability distributions, including ProbTable, ProbDist, and CumDist, expect their parameters to be arrays of probabilities, probability densities, or values, with a common index. In this case, the common index is used in generating the random sample and does not appear in the result. But, if those array parameters have any other indexes, those indexes also appear in the result, following the usual rules of Intelligent Arrays.

Multivariate Distributions library

This library offers a variety of functions for generating probability distributions that are dependent or correlated. It is distributed with Analytica. To add this library to your model see Adding library to a model.

Many of these functions specify dependence among distributions using a rank correlation number or matrix, also known as the Spearman correlation. Unlike the Pearson or product-moment correlation, rank correlation is a non-parametric measure of correlation. It is equivalent to the Pearson correlation on the ranks. It does not assume that the relationship is linear, and applies to ordinal as well as interval-scale variables. It is therefore a more robust statistic. For example, it is a more stable way to estimate the relationship between two random samples when one or both has a long tail — such as a lognormal distribution. In such cases, Pearson correlation might be misleadingly large (or small) when an extreme sample in the tail of one sample does (or does not) correspond with an extreme value in the other sample.

The methods provided to generate general multivariate distributions with specified rank correlation, first generate multivariate normal (Gaussian) distributions with specified rank correlation, and then transform them to the desired marginal distributions. The rank correlations are not changed by such transformation.

The method for generating correlated distributions (based on Iman & Conover) works for median and random Latin Hypercube as well as simple Monte Carlo simulation methods. The rank-correlations of the results are approximately, but not exactly, equal to the specified rank-correlations. The accuracy of the approximation increases with the sample size.

Create one distribution dependent on another

Normal_correl(m, s, r, y)

Generates a normal distribution with mean m, standard deviation «s», and correlation «r» with uncertain quantity «y». In mid mode, it returns «m». If «y» is not normally distributed, the result is also not normal, and the correlation is approximate. It generalizes appropriately if any of the parameters are arrays. The result array has the union of the indexes of the parameters. See also Normal_correl().

Correlate_with(s, ref, rc)

Reorders the samples of «s» so that the result has the identical values to «s», and a rank correlation close to «rc» with the reference sample, «ref». See also Correlate_with().

Example: To generate a lognormal distribution with a 0.8 rank correlation with Z, use:

Correlate_with(LogNormal(2, 3), Z, 0.8)

Note: If you have a non-default SampleWeighting of points, the weighted rank correlation might differ from «rc».

Dist_reshape(x, newdist)

Reshapes the probability distribution of uncertain quantity «x», so that it has the same marginal probability distribution (i.e., same set of sample values) as «newdist», but retains the same ranks as «x» over Run. Thus:

Rank(Sample(x), Run) = Rank(Sample(Dist_reshape(x, y)), Run)

In a mid context, it simply returns Mid(newdist), with any indexes of «x».

The result retains any rank correlations that «x» might have with other predecessor variables. So, the rank-order correlation between a third variable «z» and «x» is the same as the rank-order correlation between «z» and a reshaped version of «x», like this:

RankCorrel(x, z) = RankCorrel(Dist_reshape(x, y), z)

The operation can optionally be applied along an index «r» other than Run. See also Dist_reshape().

An array of distributions with correlation or covariance matrix

Correlate_Dists(x, rcm, m, i, j)

Given an array «x» indexed by «i» of uncertain quantities, it reorders the samples so as to match the desired rank correlation matrix, «rcm» between the x[i] as closely as possible. «rcm» is indexed by «i» and «j», which must be the same length. It must be positive definite, and the diagonal should be all ones. The result has the same marginal distributions as «x[i]», and rank correlations close to those specified in «rcm». In mid mode, it returns Mid(x). See also Correlate_Dists().

Gaussian(m, cvm, i, j)

Generates a multivariate Gaussian (i.e., normal) distribution with mean vector, «m», and covariance matrix, «cvm». «m» is indexed by «i». «cvm» must be a symmetric and positive-definite matrix, indexed by «i» and «j», which must be the same length. It is similar to MultiNormal() except that it takes a covariance matrix instead of a rank correlation matrix. Note: This is a built-in function in Analytica 5.0 and later.

MultiNormal(m, s, cm, i, j)

Generates a multivariate normal (or Gaussian) distribution with mean «m», standard deviation «s», and correlation matrix «cm». «m» and «s» can be scalar or indexed by «i». «cm» must be a symmetric, positive-definite matrix, indexed by «i» and «j», which must be the same length. It is similar to Gaussian, except that it takes a correlation matrix instead of a covariance matrix. See also MultiNormal().

BiNormal(m, s, i, c)

A 2D Normal (or bivariate Gaussian) distribution with means «m», standard deviations «s» (>0) and correlation «c» between the two variables. The index «i» must have exactly two elements. «s» must be indexed by «i». See also BiNormal().

Other parametric multivariate distributions

Dirichlet(alpha, i)

A Dirichlet distribution with parameters «alpha» > 0 indexed by «i». Each sample of a Dirichlet distribution produces a random vector indexed by «i» whose elements sum to 1. It is commonly used to represent second order probability information.

The Dirichlet distribution has a density given by:

k*Product(x^(alpha - 1), i)

where «k» is a normalization factor equal to:

GammaFn(Sum(alpha, i))/Sum(GammaFn(alpha), i)

The «alpha» parameters can be interpreted as observation counts. The mean is given by the relative values of «alpha» (normalized to 1), but the variance narrows as the alphas get larger, just as your confidence in a distribution would narrow as you get more samples. See Product(), [[Sum]() and GammaFn().

The Dirichlet lends itself to easy Bayesian updating, if you have a prior of alpha = 0, and you have «n» observations.

Multinomial(n, theta, i)

Returns the multinomial distribution, a generalization of the binomial distribution to «n» possible outcomes. For example, if you were to roll a fair die «n» times, the outcome would be the number of times each of the six numbers appears. «theta» would be the probability of each outcome, where Sum(theta, i) = 1, and index «i» is the list of possible outcomes. If «theta» doesn’t sum to 1, it is normalized.

Each sample is a vector indexed by «i» indicating the number of times the corresponding outcome (die number) occurred during that sample point. Each sample has the property

Sum(result, I) = n

UniformSpherical(i, r)

Generates points uniformly on a sphere (or circle or hypersphere). Each sample generated is indexed by «i», so if «i» has three elements, the points lie on a sphere.

The mid value is a bit strange here since there isn’t really a median that lies on the sphere. Obviously the center of the sphere is the middle value, but that isn’t in the allowed range. So, it returns an arbitrary point on the sphere. See UniformSpherical().

MultiUniform(cm, i, j, lb, ub)

The multi-variate uniform distribution. Generates vector samples (indexed by «i») such that each component has a uniform marginal distribution, and each component has the pair-wise correlation matrix «cm», indexed by «i» and «j», which must have the same number of elements. «cm» needs to be symmetric and must obey a certain semidefinite condition, namely that the transformed matrix [2*sin(30*cov)] is positive semidefinite. (In most cases, this roughly the same as «cm» being positive semidefinite.) «lb» and «ub» can be used to specify upper and lower bounds, either for all components, or individually if these bounds are indexed by «i». If «lb» and «ub» are omitted, each component has marginal Uniform(0, 1). See also MultiUniform().

Note: «cm» is the true sample correlation, not rank correlation.

The transformation is based on:

  • Falk, M., “A simple approach to the generation of uniformly distributed random variables with prescribed correlations,” Comm. in Stats - Simulation and Computation 28: 785-791 (1999).

Arrays with serial correlation

These six functions each generate an array of distributions over an index «t» such that each distribution has a specified serial correlation with the preceding element over «t». They are especially useful for modeling dynamic processes or Markov processes over time, where the value at each time step depends probabilistically on the value at the preceding time. Normal_serial_correl() and Dist_serial_correl() generate arrays of serially correlated distributions that are normal and arbitrary, respectively. Normal_additive_gro() and Dist_additive_growth() produce arrays with uncertain additive growth with serial correlation. Normal_compound_gro() and Dist_compound_growth]() produce arrays with uncertain compound growth with serial correlation

Normal_serial_correl(m, s, r, t)

Generates an array of normal distributions over index «t» with mean «m», standard deviation «s», and serial correlation «r» between successive values over index «t». You can give each distribution a different mean and/or standard deviation if «m» and/or «s» are arrays indexed by «t». If «r» is indexed by «t», r[t = k] specifies the correlation between result[t = k] and result[t = k - 1]. (Then it ignores the first correlation, r[@t = 1].) See result, @ operator and Normal_serial_correl().

Dist_serial_correl(x, r, t)

Generates an array «y» over time index «t» where each y[t] has a marginal distribution identical to «x», and serial rank correlation of «rc» with y[t - 1]. If «x» is indexed by «t», each y[t] has the same marginal distribution as x[t], but with samples reordered to have the specified rank correlation «r» between successive values. If «r» is indexed by «t», r[@t = k] specifies the rank correlation between y[@t = k] and y[@t = k - 1]. Then the first correlation, r[@t = 1], is ignored. See Dist_serial_correl().

Normal_additive_gro(x, m, s, r, t)

Generates an array of values over index «t», with the first equal to «x», and successive values adding an uncertain growth, normally distributed with mean «m» and standard deviation «s». If we denote the result by «g», «r» specifies a serial correlation between g[@t = k] and g[@t = k - 1]. «x», «m», «s», and «r» each can be indexed by «t» if you want them to vary over «t». See Normal_additive_gro().

Dist_additive_growth(x, g, rc, t)

Generates an array of values over index «t», with the first equal to «x», and successive values adding an uncertain growth «g», and serial correlation «rc» between g[@t = k] and g[@t = k - 1]. «x», «g», and «rc» each can be indexed by «t» if you want them to vary over «t». See also Dist_additive_growth().

Normal_compound_gro(x, m, s, r, t)

Generates an array of values over index «t», with the first equal to «x», and successive values multiplied by compound growth 1 + g, where «g» is normally distributed with mean «m» and standard deviation «s». It applies serial correlation «r» between g[@t = k] and g[@t = k- 1]. «x», «g», and «rc» each can be indexed by «t» if you want them to vary over «t». See also Normal_compound_gro().

Dist_compound_growth(x, g, rc, t)

Generates an array of values over index «t», with the first equal to «x», and successive values multiplying by an uncertain compound growth «g», and serial rank correlation «rc» between g[@t = k] and g[@t = k - 1]. «x», «g», and «rc» each can be indexed by «t» if you want them to vary over «t». See also Dist_compound_growth().

Uncertainty over regression coefficients

For a description of RegressionDist(), RegressionNoise(), and RegressionFitProb(), see Uncertainty in regression results.

See Also


Comments


You are not allowed to post comments.