KeelinCoefficients
KeelinCoefficients( values, percentiles, I, K, lb, ub, nTerms, flags )
This fits a Keelin distribution, also known as a MetaLog distribution, to data, or to a set of (x,p)
value - percentile level pairs, and returns a vector of coefficients indexed by «K». The vector of coefficients can be a much shorter description of the distribution than the data itself. This vector of coefficients can then be passed to the functions Keelin, DensKeelin, CumKeelin and CumKeelinInv, reducing the computation time required by those functions.
The Keelin distribution is a versatile continuous distribution that can assume the shape of almost any standard unbounded, semi-bounded on bounded continuous distribution. If you have univariate continuous data and don't know what distribution to use to model that data, with no reason to believe from first principles that the data needs to be of a particular distribution class, then the Keelin distribution is likely to be a good choice. There is no need to figure out whether your data best matches a LogNormal, Gamma, Beta or some other distribution type -- if it does happen to match one of those closely, the Keelin will usually find the same shape; however, it is capable of virtually the entire space of Skewness/Kurtosis combinations, and can even sometimes discover meaningful multimodal distinctions.
The Keelin distribution is introduced in the paper:
- Thomas W. Keelin (Nov. 2016), "The Metalog Distribution", Decision Analytics, 13(4):243-277,
Parameters
- «values»: This can be either: (1) A representative sample of data points, with «percentiles» omitted, or (2) a collection of fractile estimates (corresponding to the quantile levels in «percentiles»).
- «percentiles»: (Optional): The percentile levels (also called quantile or fractile levels) for the values in «values», also indexed by «
I
». Each number must be between 0 and 1. For example, when a value in «percentiles» is 0.05, the corresponding value in «values» is the 5th percentile. - «
I
»: (Optional): The index of «values» and «percentile». This can be omitted when either «values» or «percentile» is itself an index. - «lb», «ub»: (Optional) Upper and lower bound. Set one or both of these to a single number if you know in advance that your quantity is bounded. When neither is specified, the distribution is unbounded (i.e., with tails going to -INF and INF). When one is set the distribution is semi-bounded, and when both are set it is fully bounded.
- «nTerms»: (Optional) The number of basis terms used for the fit. This should be 2 or greater. See Keelin#Number of terms below.
- «flags»: (Optional) A bit-field, where any of the following flags can be added together.
- 1 = «values» contains coefficients (as obtained from the KeelinCoefficients function). When not set, «values» contains sample values.
- 2 = Return the basis (see #Returning the basis).
- 4 = Return the derivative of the basis.
- 8 = Do not issue a warning when infeasible. (See Infeasibility). No validation of feasibility is performed when coefficients are passed in (i.e., when «flags»=1 is set).
- 16 = Return the coefficients even when infeasible. When this bit is not set, Null is returned if infeasible. When this is set, a mid-value or sample is returned anyway.
- 32 = Disable tail constraints. The use of tail constraints is an improvement to the algorithm published in the original Keelin (2016) paper that reduces the frequency of infeasible fits to data. You can disable these to reproduce the original Keelin algorithm. See Infeasibility for more details.
To use
To use this function, you should create two indexes to pass to «I» and «K». Your «I» index indexes your data points. The «K» index will be used for the result, and typically its length determines the number of basis terms used (see Number of terms). The result is indexed by «K». If you omit «K», the function will create a local index named .K
.
In some cases you may want to create a "panel" of distributions, where you fit the same data, but vary the number of basis terms. Since you will likely want these is a single array, you want them to share the same «K» index, even though the number of terms varies. In this case, you should make your «K» index long enough for the largest basis, and then pass the «nTerms» parameter explicitly (usually you will pass it a vector, varying «nTerm» across yet another index). For example, assuming you have a vector estimate
of estimates for a series of percentile levels given in percentile
with both estimate
and percentile
indexed by I
:
- Index NumTerms :=
[5, 10, 15, 20]
- Index K :=
1..20
- Variable Coef :=
KeelinCoefficients( estimate, percentile, I, K, nTerms:NumTerms )
In this case, the result is null-padded where NumTerms
is less that IndexLength(K)
.
Your distribution data will be in one of two forms:
- A representative sample of points for your quantity, «values». In this case, omit the «percentiles» parameter.
or
- A set of
( «values», «percentiles » )
pairs. This is also equivalent to specifying points on the Cumulative Probability curve.
The first case is equivalent to the second case, when the «percentiles»s are evenly spaced.
The result
The result of the function is a coefficient vector indexed by «K». This vector can then be passed directly to any of the Keelin-distribution functions, namely:
- Keelin -- the distribution function (computes Mid- or Sample- values for uncertain quantities)
- DensKeelin, CumKeelin or CumKeelinInv -- the Analytic distribution functions.
In all these cases, the vector returned from KeelinCoefficients is passed as the «xi» parameter, and you «K» index must be passed as the index parameter «I
» of these functions. Also, you must pass the 1 bit to the «flags» parameter. All of these functions name data parameter «values», but they all also support an alias name for that parameter of «ai», so that you have an option of passing the coefficients using the named parameter convention using ai:
to emphasize that these are coefficients, like this:
Keelin( ai: a, I:K, flags:1 )
We don't recommend spending much time trying to interpret the coefficients. The first coefficient will always be the median of the distribution, but from there the others are less obvious. The second coefficient tends to track to Variance, the third is tends to track Skewness and the fourth tends to track Kurtosis. They are not, however, these actual moments. It is possible to compute all moments of the distribution directly from these parameters, see the Keelin (2016) reference, cited above.
Bounds
When your quantity is unbounded, its distribution will have tails in both direction. In this case, you should omit the «ub» and «ub» parameters. If you know your quantity is bounded from below, then specify «lb», and if you know that your quantity is bounded from above, specify «ub». The distribution supports all combinations of unbounded, bounded and semi-bounded distributions in this way.
When you compute the coefficients with a particular combination of «lb» and «ub», you must specify the same «lb» and «ub» parameters when passing these coefficients to Keelin, DensKeelin, CumKeelin, or CumKeelinInv.
Returning the basis
Unless you are doing research on the Keelin distribution itself, you probably won't have a reason to access the basis. But, if you have a need, you can use this function to return the "basis" for the distribution. This is a 2-D matrix indexed by «I» and «K» and is a function of «percentiles», but does not depend of «values». For example:
- Index I :=
1..9
- Variable Percentile :=
Table(I)(0,001, 0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99, 0.999)
- Variable Estimate :=
Table(I)( 7.5, 56, 300, 650, 1300, 2300, 3700, 7500, 11K )
- Index K :=
1..6
{ 6 basis terms } - Variable coeffs :=
KeelinCoefficients( Estimate, Percentile, I, K )
- Variable sampBasis :=
KeelinCoefficients( Sample(Uniform(0,1)), Run, K, flags:2 )
For an unbounded Keelin MetaLog, the values can be obtained from the basis and coefficients using
- Variable Samp :=
Sum( coeffs * sampBasis, K )
For the semi-bounded case with a lower bound, Ln(x-lb)
is equal to Sum( coeffs * sampBasis, K )
, hence you would use
For the upper-bounded case, Ln(ub-x)
is equal to Sum( coeffs * sampBasis, K )
, hence you would use
And for the fully-bounded case, Ln( (x-lb) / (ub-x))
is equal to Sum( coeffs * sampBasis, K )
, hence use
Enable comment auto-refresher