Kolmogorov-Smirnov Tests

Are the data points in a sample drawn from a specific known distribution? For example, are they drawn from a Gamma(3,2) distribution? Are the data points in one sample drawn from the same distribution as the data points in a second sample?

The Kolmogorov-Smirnov test(s) can be used to answer either of these questions.

The data must be univariate and continuous (i.e., each sample point is a single real number).
It works for any continuous distribution -- there is no assumption that the distribution is Normal or anything else.

The tests return a p-value, which is the probability of seeing as much distance between the two distributions (in the one-sample case, between the sample and the reference) if there is no difference in distribution. A small p-value, say p-value < 0.05, means a difference was detected, with a false-positive rate equal to 0.05. If the p-value > 0.05, you can't conclude that they are the same distribution -- only that the test and amount of data did not provide enough power to detect any difference if one exists.

Kolmogorov-Smirnov Tests Library

Download: media:Kolmogorov-Smirnov Tests Library.ana

This library provides two functions:

KS_test_one_dist( x, I, refCdf ): Tests whether the empirical distribution of data «x», which is indexed by «I», matches a hypothesized distribution whose cumulative probability function is passed as the 3rd parameter.
KS_test_two_dist( xi, xj, I, J ): Tests whether the data in «xi» and data in «xj» appear to be from the same distribution. «xi» is indexed by «I» and «xj» is indexed by «J».

Both functions return the p-value.

If p-value < 0.05: Conclude (with statistical significance 0.05) that the distributions differ.
If p-value > 0.05: You can't conclude that they have the same or different distributions. Just that there is not enough statistical power in the test, or enough data. to detect a difference in distribution if one exists.

Using KS_test_one_dist

To do a one-sample test, you need to provide a cdf function for your hypothesized distribution. You can do this either by creating a new global UDF, or by using a Local function.

To test whether the data in X is distributed as Gamma( 5.9, 1.65 ), you can first create a UDF:

Function MyHypothCDF(x) ::= CumGamma( x, 5.9, 1.65)

And then pass this to the function using:

KS_test_one_dist( x, I, MyHypothCDF )

Alternately, you can use a local function like this:

Function hypCdf(x) : CumGamma( x, 5.9, 1.65 );

KS_test_one_dist( x, I, hypCdf )

It is important to notice that a sole colon is used between the function declaration and its defining expression in the local declaration (i.e., not an assignment operator, :=).

Using KS_test_two_dist

When comparing two data sets, each may have its own index, and each can be a different length. The usage is easy -- just pass each, then pass the index of each:

KS_test_two_dist( xi, xj, I, J )