Keelin (MetaLog) distribution

(Redirected from CumKeelinFromA)
Release:

 • 4.6 •  5.0 •  5.1 •  5.2 •  5.3 •  5.4 •   •  6.0 •  6.1 •  6.2 •  6.3 •  6.4 •  6.5 •  6.6 •  7.0

New to Analytica 5.0

Keelin( values , percentiles, I, lb, ub, nTerms, flags, over)
KeelinFromA( a, K, lb, ub, nTerms, flags, over)

The Keelin distribution, also known as the "MetaLog" distribution, is a remarkably flexible distribution capable of taking on the shape of almost all common continuous distributions. It fits a distribution to specified fractile «values» with specified probability «percentiles» and optional lower and upper bound, «lb» and «ub». For the common special case, when you have three fractiles with symmetric probabilities -- for example 10, 50, and 90th percentiles where 10 and 90 are symmetrical around 50 -- it is more convenient to use the UncertainLMH() function. The Keelin distribution is also a convenient way to fit a probability distribution to a random sample of data, without having to worry about what parametric form is best.

The Keelin distribution is among the most versatile of all distributions, with an ability to produce unbounded, semi-bounded and bounded distributions, including uniform, beta, normal, gamma, with nearly any theoretically possible combination of skewness and kurtosis. In this respect, it is even more flexible that the family of Pearson distributions. Tom Keelin introduced the distribution in:

The Keelin MetaLog 2.0 distribution is introduced in

You can specify a Keelin distribution in three ways:

  • From a list of (value, percentile) pairs, where the «values» are sample values for the quantity and the «percentiles» are the percentile (aka quantile or fractile) probabilities between 0 and 1. Another way of saying this is that «values», «percentiles» are points on the cumulative probability curve.
  • From a set of representative points (a data sample) (by specifying «values» without «percentiles»).
  • From a coefficient vector (by using the KeelinFromA variant). These coefficients are typically obtained from data «values», or from (value, percentile) pairs using the function KeelinCoefficients. The coefficient vector may be a much shorter description of the distribution than the original data.

Parameters

  • «values»: This parameter may be either (1) a random sample of data points from a distribution, with «percentiles» omitted, or (2) a collection of fractile or percentile values (corresponding to the probability levels in «percentiles»). Legacy models might also pass a coefficient vector («a») in this parameter with the «flags»=1 bit set; however, starting with Analytica 6.6 separate Keelin*FromA functions are used for this case instead.
  • «percentiles»: (Optional): The percentile probabilities (also called quantile or fractile levels) for the values in «values», also indexed by «I». Each number must be between 0 and 1. For example, when a value in «percentiles» is 0.05, the corresponding value in «values» is the 5th percentile.
  • «I»: (Optional): The index of «values» and «percentile». You can omit this index when either «values» or «percentile» is itself an index.
  • «a»: A coefficient vector indexed by K. This parameter is used by KeelinFromA functions in place of «values», «percentiles» and «I». The «a» vector is typically obtained by calling KeelinCoefficients.
  • «K»: The basis index, which indexes «a».
  • «lb», «ub»: (Optional) Upper and lower bound. Set one or both of these to a single number if you know in advance that your quantity is bounded. When neither is specified, the distribution is unbounded -- i.e., with tails going to -INF and INF. When one is set the distribution is semi-bounded, and when both are set it is fully bounded.
  • «nTerms»: (Optional) The number of basis terms used for the fit. This should be 2 or greater. See #Number of terms below.
  • «flags»: (Optional) A bit-field, where any of the following flags can be added together.
    1: ( deprecated) «values» contains coefficients (as obtained from the KeelinCoefficients function). When not set, «values» contains sample values. As of Analytica 6.6, this flag is retired in favor of using the KeelinFromA function instead.
    8: Do not issue a warning when the Keelin distribution is infeasible. (See Infeasibility below). It does not test feasibility when «values» are Keelin coefficients -- i.e., when «flags»=1.
    16: Return a sample even when infeasible. When this bit is not set, Null is returned if infeasible. When this is set, a mid-value or sample is returned anyway.
    32: Disable tail constraints. By default, the Keelin function uses tail constraints to reduce the frequency of infeasible, an improvement to the algorithm published in the original Keelin paper. You can disable these to reproduce the original Keelin algorithm. See Infeasibility for details.
    64: (New to Analytica 6.5) Use the legacy basis ordering. As of Analytica 6.5, the MetaLog 2.0 ordering is used, but prior to that, the legacy ordering is use. Use this to force the basis ordering to match a work that assumes a legacy ordering. You can force it to default to the legacy ordering by setting RandomLegacyRelease to 60499 or earlier. {{Release|1=6.6|2=|3=
    128: (New to Analytica 6.6) Disable the feasibility guarantee (for legacy behavior).
  • «over»: (Optional) A list of indexes to sample independently across.

See also the analytic functions for the Keelin distribution (see Analytic distribution functions below):

CumKeelinInv( p, values, percentiles, I, lb, ub, nTerms, flags)
CumKeelin( x, values, percentiles, I, lb, ub, nTerms, flags)
DensKeelin( x,p, values, percentiles, I, lb, ub, nTerms, flags)
CumKeelinFromAInv( p, a, K, lb, ub, nTerms )
CumKeelinFromA( x, a, K, lb, ub, nTerms )
DensKeelinFromA( x, p, a, K, lb, ub, nTerms )

Functional form

The Keelin MetaLog distribution is a quantile-parameterized distribution, in which its inverse cumulative density function (aka quantile) takes the form:

[math]\displaystyle{ Q(p) = F^{-1}(p) = s(p) + \mu(p) logit(p) }[/math]

where [math]\displaystyle{ s(p) }[/math] and [math]\displaystyle{ \mu(p) }[/math] are both polynomials. When both are zero-degree polynomials (i.e., constants), it is a Logistic distribution.

We can write this as

[math]\displaystyle{ Q(p) = F^{-1}(p) = \sum_{k=1}^n B_k(p) }[/math]

where [math]\displaystyle{ B }[/math] is the basis and given by

[math]\displaystyle{ B_k(p) = \left\{ \begin{array}{ll} \ (p-0.5)^{\lfloor \frac{k-1}{2} \rfloor} & \mbox{when }k\mod 4\in \{0,1\} \\ (p-0.5)^{\lfloor \frac{k-1}{2} \rfloor} logit(p) & \mbox{when }k\mod 4\in\{2,3\} \end{array}\right. }[/math]

The original Keelin paper used a slightly different basis order from what is shown above, which we call the legacy basis order, given by:

[math]\displaystyle{ B_k(p) = \left\{ \begin{array}{ll} \ (p-0.5)^{\lfloor \frac{k-1}{2} \rfloor} & \mbox{when }k=1, k=4\mbox{ or }k\ge 5\mbox{ and $k$ is odd} \\ (p-0.5)^{\lfloor \frac{k-1}{2} \rfloor} logit(p) & \mbox{when }k=2, k=3\mbox{ or }k\ge6\mbox{ and $k$ is even} \end{array}\right. }[/math]

The newer (Metalog 2.0) basis is now used (as of Analytica 6.5) due to superior numerical properties proved in [Baucells et. al. 2005]. The ordering difference does not affect Keelin distributions with 6 or fewer terms. Above six terms, the same distribution results for an even number of terms, but the ordering of some of the columns are swapped. For an odd number of terms with 7 terms or more, a different distribution results.


Different parameterizations

When using a Keelin distribution, you start with a set of quantile assessments or a set of data points. You can either obtain your final result in a single call without ever obtaining the "a- coefficient vector", or you can first compute the a-coefficient vector and then pass that to the function of interest.

When using a single function call, pass your data to the «values» and optionally «percentiles» parameters of the Keelin function (or CumKeelin, DensKeelin or CumKeelinInv function if using an analytic distribution function). For example:

Chance Channel_depth ::= Keelin( depth_measurements, I: measurement_index, nTerms:8 )

Using two steps provides you with the coefficient vector and performs the fit only once. If you needed to evaluate two functions (like you needed both the density and the cumulative probability), then it is more efficient to compute the a-coefficient vector only once. To use this method, first call KeelinCoefficients:

Index K ::= 1..8
Variable a ::= KeelinCoefficients( depth_measurements, I: measurement_index, K: K)

Then pass the coefficient vector, a, to the corresponding *FromA function.

KeelinFromA( a, K )

Bounds treatments

When you want bounded support (i.e., a lower or upper bound, or both), the MetaLog offers two approaches for incorporating the bounds -- the direct method and the transform method.

Using direct method for bounds

Recall that the Keelin Metalog distribution's quantile function has the form:

[math]\displaystyle{ M(y) = \mu(y) + s(y) logit(y) }[/math]

When [math]\displaystyle{ s(0)=0 }[/math], it follows that [math]\displaystyle{ \lim_{y\rarr 0} M(y) = \mu(0) }[/math], resulting in a distribution with a lower bound of [math]\displaystyle{ \mu(0) }[/math]. Similarly, when [math]\displaystyle{ s(1)=0 }[/math] then the distribution has an upper bound of [math]\displaystyle{ \mu(1) }[/math]. The direct method incorporates either or both bounds by constraining the choice of coefficients.

Because each bound reduces the degrees of freedom by 2, you need to increase the number of terms by 2 for a semi-bounded cases, or by 4 for a fully bounded case, to attain the comparable level of shape flexibility.

One advantage of using the direct method for bounds is that all known properties and closed-form statistics continue to hold (see Baucells et. al. (2025), "On the properties of the MetaLog distribution"). This includes the properties available from the KeelinPropertyFromA functions, which includes closed-form expressions for moments and partial expectation.

A downside of the direct method is that the incorporation of these equality constraints comes with extreme numeric sensitivity. Even a small deviation from [math]\displaystyle{ s(0)=0 }[/math], like even [math]\displaystyle{ s(0)=1e-12 }[/math], results in an infinite tail again. In addition, the transform method for bounds is slightly more flexible in the sense that it expands the possible combinations of skewness and kurtosis that are attainable.

To use the direct method for bounds, you need to specify at least one of the «lb» and «ub» parameters and «flags»=512 when using any Keelin distribution function that does not have "FromA" in its name. Once you obtain the «a» coeffcients from KeelinCoefficients(...), do not repeat the «lb» or «ub» to any Keelin*FromA function.

Using the Transform method for bounds

The second method for incorporating bounded or semi-bounded support uses a Logit- or Log- transform. This is the method that is described in the original MetaLog paper]. For backwards compatibility reasons, this is the default method if you don't specify the «flags»=512 bit.

When [math]\displaystyle{ \log(x) }[/math] has an unbounded distribution, then [math]\displaystyle{ x }[/math] has a semi-bounded distribution with a lower-bound of 0. Similarly, when [math]\displaystyle{ logit(x) }[/math] has an unbounded distribution, the distribution for x has bounded support on (0,1). The transform method uses these facts. It would be natural to call these transformed distributions the log-MetaLog (or log-Keelin) distribution and logit-MetaLog or logit-Keelin distribution, following the convention used by other distributions (e.g., LogNormal distribution, Log-Uniform distribution, Log-Logistic distribution, Logit-normal distribution etc.), but this convention is not used in the case of the Keelin distribution. Tom Keelin's original MetaLog paper defined the "MetaLog distributions" to be a term referring to the collection of all these transform variants. Hence, we use the term untransformed Keelin distribution to mean the base-distribution having the quantile function [math]\displaystyle{ M(y) }[/math]. (We no longer use the term unbounded Keelin distribution for this variant since, as seen in the previous subsection, the untransformed variant can be unbounded, semi-bounded or fully-bounded.

To use a transform method for bounds, pass «lb» and «ub» to every function in the Keelin distribution family of functions, including to any Keelin*FromA functions. You do not need to specify any extra «flag».

Specifying a Mean

In addition to specifying the «values» and (optionally) «percentiles», you can also specify the mean value for the distribution. For example:

Keelin( data, ,I, K, mean:10 )

This will fit the data with an extra constraint that the mean of the resulting distribution will be 10. When you specify the mean, it reduces the degrees of freedom by 1, so that you should increase the number of terms by 1 in order to maintain a comparable shape flexibility.

There are a few caveats to the use of «mean» when you use bounded support, i.e., you specify either «lb», «ub» or both parameters.

When using the direct method of bounds (i.e., with the «flags»=512 bit), you can attain a specified «mean» for either semi-bounded case (provided the specified «mean» is on the correct side of the bound), but in general it is not possible to satisfy both bounds and the mean simultaneously in all cases. If the «mean» gets too close to either «lb» or «ub» then no feasible solution will exist at a given number of terms. The more terms you use, the closer the «mean» can be to «lb» or «ub», but eventually numeric precision prevents extreme proximity at any number. If the «mean» is within the middle third of the bounded support, then it is possible to achieve the mean at any number of terms.

When you are using a transform method for bounds, the specified «mean» will not be exactly satisfied. Instead, the transformed «mean» will be exactly satisfied in the transformed, unbounded underlying distribution, but only approximate in the bounded or semi-bounded distribution.

Examples

Fit to data

Suppose you've collected data on the weights of fish caught last year in the Columbia river, and now you want to fit a distribution to these measurements. Since you know that a fish's weight cannot be negative, you'll use a semi-bounded distribution. Suppose the data is in a variable named Fish_weight which in indexed by Fish_ID. Use

Keelin( Fish_weight, I:Fish_ID, lb: 0)

Using fractiles

You find a published table stating the 500-year, 100-year, 10-year and median rain fall levels for a town of interest (where the 500-year level is a level so big that it is experienced only once every 500 years).

Index Fractile := [ 1/2, 1/10, 1/100, 1/500 ]
Variable Rainfall_level := Table(Fractile)(5, 12, 25, 60)
Chance Rainfall := Keelin( Rainfall_level, Fractile, lb:0 )

The resulting CDF is plotted here on a log-X scale:

Keelin rainfall.png

Infeasibility

Some a-coefficient vectors are infeasible. If you were to use such a coefficient vector, the probability density would appear negative in some places, your cumulative probability function would reverse (have negative slope) in places, and the tails may even reverse, with the left tail going right or the right tail going left!

Until recently, infeasibility often complicated the use of the Metalog distribution. The original Keelin (2016) algorithm would return infeasible (and thus unusable) «a» vectors in many cases.

Metalog 2.0, introduced in [Baucells, Chrisman, Keelin and Xu (2025), "On the properties of the MetaLog distribution"] fixed this problem. Starting in Analytica 6.6, only feasible coefficient vectors are used. The algorithm now finds the best fit among all feasible functions have the Metalog functional form (rather than among all functions having that functional form). Specifically, it solves for

[math]\displaystyle{ \mbox{minimize}_\vec{a} \sum_i ( \hat{x}_i - M(\hat{y}_i | \vec{a}) )^2 }[/math]
such that [math]\displaystyle{ M'(y) \ge 0 }[/math] for all [math]\displaystyle{ 0\lt y\lt 1 }[/math], where [math]\displaystyle{ (\hat{x},\hat{y}) }[/math] are the data points on the CDF.

In the event you want to match the infeasible behavior of Metalog 1.0, you can pass flags:128 to any of the Keelin functions that accept a «values» parameter. If you are still using Analytica 6.5 or earlier, then you need to worry about infeasibility.


Number of terms

The optional «nTerms» parameter varies the number of basis terms used for the fit. A larger number of terms results in a more detailed fit, but may also overfit when the data has randomness. Varying «nTerms» may also impact whether the data is or is not infeasible.

With 2 terms, the smallest that should be considered, the distribution is limited to a Logistic distribution (or Log-Logistic or Logit-Logistic when «lb» or «ub» are set), which gives it enough flexibility to match mean and standard deviation, but not skewness or kurtosis. With 3 terms skewness can be adjusted, but not kurtosis. With 4 terms, the median, variance, skewness and kurtosis can all be adjusted. In most cases, increasing «nTerms» enables it to fit your target distribution more closely.

You may find it useful to create a panel of fit distributions by varying «nTerms», making it possible to see what detail is revealed by the addition of terms, and also where the addition of more terms doesn't add useful detail. In many cases I've observed that there is an improved "fit" up to a point, followed by a plateau with very little change as «nTerms» increases, eventually followed by obvious over-fitting where it starts capturing the random spacing of samples. Often the plateau lasts for a long time. In these cases, it would make sense to set «nTerms» to a value on the plateau.

To create a "panel", first create an index: Index NumTerms := 2..50

Then use your data (say x indexed by I) to explore these: Variable fit_x := Keelin( x, ,I, nTerms:NumTerms)

In the following experiment, a data set with 100 measurements (not from any know distribution) was fit. A histogram of the data itself is shown here:

Keelin histo of data.png

Here are four "fit" Keelin distributions as «nTerms» was varies from 2 to 20:

Keelin nTerms varies.png

At 10 terms, a bi-modal effect starts to appear, which may actually be there in the data, and which is not visible below 10 terms, However, at 20 terms there appears to be more variation than is probably warranted, which we might interpret as the onset of overfitting.

Analytic distribution functions

The Keelin function is a distribution function, meaning that it returns the median when evaluated in Mid-mode, a sample for Monte Carlo analysis when evaluated in Sample-mode, or a random variate when it is used in Random(). There are also analytic distribution functions for Keelin described in this section. The analytic distribution functions compute the exact value for a given Keelin without sampling error. These use the same parameters as Keelin( ), but also include a point of percentile where the analytic function is to be evaluated at, and this parameter comes first. It is common to pass an array of values in for the point or percentile parameter.

There are two forms of each function, one accepting «values», the other accepting «a», the coefficient vector which you would have previously "fit" to your data using KeelinCoefficients.

CumKeelinInv( p, values, percentiles, I, lb, ub, nTerms, flags)
CumKeelinInvFromA( p, a, K )

The inverse cumulative distribution function, also called the quantile function. Given a percentile (quantile) level «p», returns the value x such that there is a «p» probability that the actual value (or a value sampled at random from the distribution) is less than or equal to x.

Given a data set xi indexed by I, the 10-25-50-75-90 percentiles are obtained using:

Index percent=[10%, 25%, 50%, 75%, 90%] Do CumKeelinInv( percent, xi, , I)

Passing a Uniform(0,1) for «p» is equivalent to the Keelin distribution function:

CumKeelinInv( Uniform(0,1), xi )

CumKeelin( x, values, percentiles, I, lb, ub, nTerms, flags)
CumKeelinFromA(x, a, K)

The cumulative distribution function. For a given value «x», returns the probability that the true value (or a value sampled at random from the distribution) is less than or equal to «x».

Index quatile := [10%, 25%, 50%, 75%, 90%]
Index estimate := Table(quantile)( 20, 40, 70, 100, 130 )

This returns the probability that the quantity is less than 50:

CumKeelin( 50, estimate, quantile, quantile) → 0.33

DensKeelin( x,p, values, percentiles, I, lb, ub, nTerms, flags)
DensKeelinFromA( x,p, a, K )

The probability density function. Either «x» or «p», but not both, must be specified. When «x» is specified, returns the probability density at «x». When «p» is specified, returns the probability density at the «p»th quantile.

Using quantile and estimate shown in the example for CumKeelin above,

DensKeelin( 50, values:estimate, percentiles:quantile, I:quantile ) → 8.16m
DensKeelin( p:33%, values:estimate, percentiles:quantile, I:quantile ) → 8.16m

A good way to graph the probability density function is to create an X-Y plot of DensKeelin versus CumKeelinInv. This is illustrated here, again using the quantile and estimate example from above.

In Analytica 6.6 or later:

Index p := Sequence(0.1%, 99.9%, 0.1%);
Index K := 1..5;
Local a := KeelinCoefficients( estimate, quantile, quantile, K );
Local x := CumKeelinInvFromA( p, a, K);
Local density :=  [[DensKeelinFromA( p, a, K);
Index AxisLabel := ['X', 'Probability Density'];
Array( AxisLabel, [x,density] )

In Analytica 6.5 or earlier:

Index p := Sequence(0.1%, 99.9%, 0.1%);
Index K := 1..5;
Local a := KeelinCoefficients( estimate, quantile, quantile, K );
Local x := CumKeelinInv( p, a, I:K, flags:1 );
Local density :=  DensKeelin( p:p, values:a, I:K, flags:1 );
Index AxisLabel := ['X', 'Probability Density'];
Array( AxisLabel, [x,density] )

The Analytica 6.5 version also works in Analytica 6.6 but is less clear.

You need to plot this as an X-Y chart by pressing the [XY] button in the result window and enabling Use a comparison index. The comparison index here will be .AxisLabel.

KeelinDensityPlot.png

In the graph, an efficiency is obtained by calling KeelinCoefficients so that the data fit is performed only once, so that it doesn't have to be repeated in both successive calls to CumKeelinInv and DensKeelin. It evaluates both at 999 points (the number of points in p. Zero and 1 are not included in p since these would be -Inf and Inf in a distribution with tails like this. A nice thing about this approach is that it is easy to vary over the full range of p without having to know the actual range of the quantity itself.

KeelinPropertyFromA( a, K, property )

You can use the KeelinPropertyFromA function to compute statistical moments or partial expectation, find modes, test for feasibility, and several other properties using analytic (closed-form) methods once you have an «a» vector. See KeelinPropertyFromA for details.

See Also

Comments


You are not allowed to post comments.