Difference between revisions of "Keelin (MetaLog) distribution"

Revision as of 19:10, 31 March 2017

Keelin( xi, pi, I, lb, ub, nTerms, flags, over)

The Keelin distribution, also known as the Keelin MetaLog distribution. This is a smooth, continuous distribution that can be specified in one of three ways:

From a set of representative points (a data sample).
From a list of (xi,pi) pairs, where the «xi» are fractile values and the «pi» are fractile levels. Another way of saying this is that «xi», «pi» are points on the cumulative probability curve.
From a coefficient vector. These coefficients are typically obtained from data «xi», or from (xi,pi) pairs using the function KeelinCoefficients. The coefficient vector may be a must shorter description of the distribution than the original data. When passing coefficients, you must specify the «flags»=1 bit, and the «I» index indexes the basis terms rather than the data.

The Keelin distribution is introduced in the paper:

Thomas W. Keelin (Nov. 2016), "The Metalog Distribution", Decision Analytics, 13(4):243-277,

The Keelin distribution is a highly flexible distribution that is capable of taking on almost all common distributional shapes. It is among the most versatile of all distributions, with an ability to produce unbounded, semi-bounded and bounded distributions with nearly any theoretically possible combination of skewness and kurtosis. In this respect, it is even more flexible that the family of Pearson distributions.

If you have a data sample that is representative of your quantity, and you wonder which distribution you should fit to your data, the Keelin is a good option. Instead of worrying about finding which parametric form you need, the Keelin distribution usually adapts to the data quite nicely.

If you need a Keelin distribution based on 3 symmetric fractiles, such as based on 10-50-90 percentile estimates, use the UncertainLMH() function. UncertainLMH() is a more-convenient special case of Keelin() for that purpose.

Parameters

«xi»: This can be either: (1) A representative sample of data points, with «pi» omitted, (2) a collection of fractile estimates (corresponding to the fractile levels in «pi»), or (3) a Keelin coefficient vector with the «flags»=1 bit set. In all cases, «xi» must be indexed by «I».
«pi»: (Optional): The fractile levels for the values in «xi», also indexed by «I». For example, when «pi» is 0.05, the corresponding value if «xi» is the 5th percentile.
«I»: (Optional): The index of «xi» and «pi». This can be omitted when either «xi» or «pi» is itself an index.
«lb», «ub»: (Optional) Upper and lower bounds. Set these if you know in advance that your quantity is bounded. When neither is specified, the distribution is unbounded (i.e., with tails going to -INF and INF). When one is set the distribution is semi-bounded, and when both are set it is fully bounded.
«nTerms»: (Optional) The number of basis terms used for the fit. This should be 2 or greater. See #Number of terms below.
«flags»: (Optional) Specify flags:1 when «xi» contains coefficients (as obtained from the KeelinCoefficients function).
«over»: (Optional) A list of indexes to sample independently across.

Examples

Fit to data

Suppose you've collected data on the weights of fish caught last year in the Columbia river, and now you want to fit a distribution to these measurements. Since you know that a fish's weight cannot be negative, you'll use a semi-bounded distribution. Suppose the data is in a variable named Fish_weight which in indexed by Fish_ID. Use

Keelin( Fish_weight, I:Fish_ID, lb: 0)

Using fractiles

You find a published table stating the 500-year, 100-year, 10-year and median rain fall levels for a town of interest (where the 500-year level is a level so big that it is experienced only once every 500 years).

Index Fractile := [ 1/2, 1/10, 1/100, 1/500 ] Variable Rainfall_level := Table(Fractile)(5, 12, 25, 60)

Chance Rainfall := Keelin( Rainfall_level, Fractile, lb:0 )

The resulting CDF is plotted here on a log-X scale:

Number of terms

The optional «nTerms» parameter varies the number of basis terms used for the fit. A larger number of terms results in a more detailed fit, but may also overfit when the data has randomness.

With 2 terms, the smallest that should be considered, the distribution is limited to a Logistic distribution (or Log-Logistic or Logit-Logistic when «lb» or «ub» are set), which gives it enough flexibility to match mean and standard deviation, but not skewness or kurtosis. With 3 terms skewness can be adjusted, but not kurtosis. With 4 terms, the median, variance, skewness and kurtosis can all be adjusted. In most cases, increasing «nTerms» enables it to fit your target distribution more closely.

You may find it useful to create a panel of fit distributions by varying «nTerms», making it possible to see what detail is revealed by the addition of terms, and also where the addition of more terms doesn't add useful detail. In many cases I've observed that there is an improved "fit" up to a point, followed by a plateau with very little change as «nTerms» increases, eventually followed by obvious over-fitting where it starts capturing the random spacing of samples. Often the plateau lasts for a long time. In these cases, it would make sense to set «nTerms» to a value on the plateau.

To create a "panel", first create an index: Index NumTerms := 2..50

Then use your data (say x indexed by I) to explore these: Variable fit_x := Keelin( x, ,I, nTerms:NumTerms)

In the following experiment, a data set with 100 measurements (not from any know distribution) was fit. A histogram of the data itself is shown here:

Here are four "fit" Keelin distributions as «nTerms» was varies from 2 to 20:

At 10 terms, a bi-modal effect starts to appear, which may actually be there in the data, and which is not visible below 10 terms, However, at 20 terms there appears to be more variation than is probably warranted, which we might interpret as the onset of overfitting.

@@ Line 1: / Line 1: @@
-[[category::Distribution Functions]]
+[[category:Distribution Functions]]
-[[category::Analytica 5.0]]
+[[category:Analytica 5.0]]
-== Keelin( x'', p, I, lb, ub, nTerms, over'') ==
+== Keelin( xi'', pi, I, lb, ub, nTerms, flags, over'') ==
-The ''Keelin distribution'', also known as the Keelin MetaLog distribution. This is a smooth, continuous distribution that is specified either directly from a set of representative points (a data sample), or from a list of <code>(x,p)</code> pairs, where the «x» are fractile values and the «p» are fractile levels. Another way of saying this is that «x», «p» are points on the cumulative probability curve.
+The ''Keelin distribution'', also known as the Keelin MetaLog distribution. This is a smooth, continuous distribution that can be specified in one of three ways:
+* From a set of representative points (a data sample).
+* From a list of <code>(xi,pi)</code> pairs, where the «xi» are fractile values and the «pi» are fractile levels. Another way of saying this is that «xi», «pi» are points on the cumulative probability curve.
+* From a coefficient vector. These coefficients are typically obtained from data «xi», or from <code>(xi,pi)</code> pairs using the function [[KeelinCoefficients]]. The coefficient vector may be a must shorter description of the distribution than the original data. When passing coefficients, you must specify the «flags»=1 bit, and the «I» index indexes the basis terms rather than the data.
-The Keelin distribution was introduced in the paper:
+The Keelin distribution is introduced in the paper:
 * Thomas W. Keelin (Nov. 2016), "[http://pubsonline.informs.org/doi/10.1287/deca.2016.0338 The Metalog Distribution]", Decision Analytics, 13(4):243-277,
@@ Line 17: / Line 20: @@
 === Parameters ===
-* «x»: A collection of fractile estimates (corresponding to the fractile levels in «p»), or a representative sample of data when «p» is omitted. «x» is indexed by «<code>I</code>».
+* «xi»: This can be either: (1) A representative sample of data points, with «pi» omitted, (2) a collection of fractile estimates (corresponding to the fractile levels in «pi»), or (3) a Keelin coefficient vector with the «flags»=1 bit set. In all cases, «xi» must be indexed by «<code>I</code>».
-* «p»: (Optional): The fractile levels for the values in «x». Like «x», this is also indexed by «<code>I</code>». For example, when «p» is 0.05, the corresponding value if «x» is the 5th percentile.
+* «pi»: (Optional): The fractile levels for the values in «xi», also indexed by «<code>I</code>». For example, when «pi» is 0.05, the corresponding value if «xi» is the 5th percentile.
-* «<code>I</code>»: (Optional): The index of «x» and «p». This can be omitted when either «x» or «p» is itself an index.
+* «<code>I</code>»: (Optional): The index of «xi» and «pi». This can be omitted when either «xi» or «pi» is itself an index.
 * «lb», «ub»: (Optional) Upper and lower bounds. Set these if you know in advance that your quantity is bounded. When neither is specified, the distribution is unbounded (i.e., with tails going to -INF and INF). When one is set the distribution is semi-bounded, and when both are set it is fully bounded.
 * «nTerms»: (Optional) The number of basis terms used for the fit. This should be 2 or greater. See [[#Number of terms]] below.
+* «flags»: (Optional) Specify <code>flags:1</code> when «xi» contains coefficients (as obtained from the [[KeelinCoefficients]] function).
 * «over»: (Optional) A list of indexes to sample independently across.
@@ Line 66: / Line 70: @@
 == See Also ==
-* [[UncertainLMH]]
+* [[UncertainLMH]]   -- Simpler version when specifying say 10-50-90 or 25-50-75 estimates.
+* [[KeelinCoefficients]]
+* [[DensKeelin]], [[CumKeelin]], [[CumKeelinInv]] -- analytic distribution functions for Keelin
 * [[CumDist]], [[Smooth_Fractile]]
 * [[ProbDist]]