How to Fit a Distribution to Data

This article discusses techniques for fitting a given distribution type to historical data. The problem of determining which distribution type best reflects a data set is a bit different and is not covered here.

Generalized regression techniques such as Logistic Regression are used to predict the probability of an outcome from many input variables. This article covers a simpler topic -- estimating a marginal distribution without any conditionality on any inputs. For example, given a set of data between 0 and 1, how would you find the parameters of the best fit Beta distribution?

Once a distribution type has been identified, the parameters to be estimated have been fixed, so that a best-fit distribution is usually defined as the one with the maximum likelihood parameters given the data.

Specific Estimation Formulae

Many textbooks provide parameter estimation formulas or methods for most of the standard distribution types. Use of these are, by far, the easiest and most efficient way to proceed. For example, the parameters of a best-fit Normal distribution are just the sample Mean and sample standard deviation.

The book Uncertainty by Morgan and Henrion, Cambridge University Press, provides parameter estimation formula for many common distributions (Normal, LogNormal, Exponential, Poisson, Gamma, Weibull, Uniform, Triangular, and Beta). Estimation formula for other distribution types can often be found on Wikipedia.

Some estimation formula are summarized here. Data is denoted by x, and the index of the data by I.

Normal(m,s)

m = Mean(x,I)
s = SDeviation(x,I)

LogNormal(med,gsdev)

med = Exp(Mean(Ln(x)))
gsdev = Exp(SDeviation(Ln(x)))

Exponential(m)

m = Mean(x,I)

Poisson(m)

m = Mean(x,I)

Gamma(a,b)

a = (Mean(x,I) / SDeviation(x,I))^2
b = Mean(x,I) / SDeviation(x,I)^2

Weibull(k,c)

The parameters for Weibull are fit using a regression. By re-arranging the CDF of the Weibull and substituting Z = Ln(-Ln(1-F(x))) and Y = Ln(x), the relationship between Z and Y is linear, so we can use Regression to fit Z = mY + b. Thus, the procedure is this:

Index bm := ['b','m'];
var Fx :=  (Rank(Data,Data_Index) - 0.5) / Size(Data_Index);
var Z := Ln(-Ln(1-Fx));
var fit := Regression(Z,Array(bm,[1,ln(x)]),I,bm);
var m := fit[bm='m'];
var b := fit[bm='b'];
...

Then the Weibull parameters are given by

k = m
c = Exp(-b/m)

Uniform(a,b)

With the Uniform, a maximum likelihood criteria suggests using:

a = Min(x,I) - Epsilon
b = Max(x,I) + Epsilon

where Epsilon is a very small positive number. However, in this case, this criteria does not correspond to what intuition would consider a best fit. Some people use:

a = Mean(x,I) - 3 * SDeviation(x,I)
b = Mean(x,I) + 3 * SDeviation(x,I)

This does not guarantee that the fitted distribution contains all the points (i.e., the likelihood of the data could be zero). A different approach is to consider that it is highly improbable that the smallest point observed was exactly at the lower bound. Most likely the distance from the lower bound to the smallest point is about half the average distance between adjacent points. This leads to a re-estimation formula of:

a = Min(x,I) - (Max(x,I)-Min(x,I)) / (2*Size(I))
b = Max(x,I) + (Max(x,I)-Min(x,I)) / (2*Size(I))

Triangular(a,b,c)

Beta(a,b)

a = (Mean(x,I)^2 - Mean(x,I)^3 - SDeviation(x,I)^2 * Mean(x,I)) / SDeviation(x,I)^2
b = (Mean(x,I) * (1-Mean(x,I)^2 - SDeviation(x,I)^2 * (1-Mean(x,I))) / SDeviation(x,I)^2
Comments


You are not allowed to post comments.