Kolmogorov-Smirnov Tests


Are the data points in a sample drawn from a specific known distribution? For example, are they drawn from a Gamma(3,2) distribution? Are the data points in one sample drawn from the same distribution as the data points in a second sample?

The Kolmogorov-Smirnov test(s) can be used to answer either of these questions.

  • The data must be univariate and continuous (i.e., each sample point is a single real number).
  • It works for any continuous distribution -- there is no assumption that the distribution is Normal or anything else.

The tests return a p-value, which is the probability of seeing as much distance between the two distributions (in the one-sample case, between the sample and the reference) if there is no difference in distribution. A small p-value, say p-value < 0.05, means a difference was detected, with a false-positive rate equal to 0.05. If the p-value > 0.05, you can't conclude that they are the same distribution -- only that the test and amount of data did not provide enough power to detect any difference if one exists.

Kolmogorov-Smirnov Tests Library

Download: media:Kolmogorov-Smirnov Tests Library.ana

To use this library, use File / Add Library... from the File menu after downloading.

This library provides two functions:

KS_test_one_dist( x, I, refCdf )
Tests whether the empirical distribution of data «x», which is indexed by «I», matches a hypothesized distribution whose cumulative probability function is passed as the 3rd parameter.
KS_test_two_dist( xi, xj, I, J )
Tests whether the data in «xi» and data in «xj» appear to be from the same distribution. «xi» is indexed by «I» and «xj» is indexed by «J».

Both functions return the p-value.

If p-value < 0.05
Conclude (with statistical significance 0.05) that the distributions differ.
If p-value > 0.05
You can't conclude that they have the same or different distributions. Just that there is not enough statistical power in the test, or enough data. to detect a difference in distribution if one exists.

Using KS_test_one_dist

To do a one-sample test, you need to provide a cdf function for your hypothesized distribution. You can do this either by creating a new global UDF, or by using a Local function.

To test whether the data in X is distributed as Gamma( 5.9, 1.65 ), you can first create a UDF:

Function MyHypothCDF(x) ::= CumGamma( x, 5.9, 1.65)

And then pass this to the function using:

KS_test_one_dist( x, I, MyHypothCDF )

Alternately, you can use a local function like this:

Function hypCdf(x) : CumGamma( x, 5.9, 1.65 );
KS_test_one_dist( x, I, hypCdf )

It is important to notice that a sole colon is used between the function declaration and its defining expression in the local declaration (i.e., not an assignment operator, :=).

Using KS_test_two_dist

When comparing two data sets, each may have its own index, and each can be a different length. The usage is easy -- just pass each, then pass the index of each:

KS_test_two_dist( xi, xj, I, J )

Limitations

  • The p-value from KS_test_one_dist is overly weak when you use the data to estimate parameters of your hypothesized distribution. If you do this and still get a p-value < 0.05, you are safe to conclude that the distributions differ, but in general, the p-value will be more conservative than necessary. In that case, other statistical tests will be more powerful.
  • Stronger statistical tests exist for specific distributions. KS is weaker since it makes no distributional assumption.

There are some things that should be cleaned up in this model.

  • The identifiers that appear in the Examples module don't have a namespace prefix. Hence, they are at risk for name collisions. (If name collisions occur, you'll see a dialog when you Add the library to your model).
  • You'll get a warning if you open this in a release prior to Analytica 6.3. You can ignore the warning as long as your release is fairly recent (like 6.x).

See Also

Comments


You are not allowed to post comments.