Kolmogorov-Smirnov Tests
Are the data points in a sample drawn from a specific known distribution? For example, are they drawn from a Gamma(3,2)
distribution? Are the data points in one sample drawn from the same distribution as the data points in a second sample?
The Kolmogorov-Smirnov test(s) can be used to answer either of these questions.
- The data must be univariate and continuous (i.e., each sample point is a single real number).
- It works for any continuous distribution -- there is no assumption that the distribution is Normal or anything else.
The tests return a p-value, which is the probability of seeing as much distance between the two distributions (in the one-sample case, between the sample and the reference) if there is no difference in distribution. A small p-value, say p-value < 0.05, means a difference was detected, with a false-positive rate equal to 0.05. If the p-value > 0.05, you can't conclude that they are the same distribution -- only that the test and amount of data did not provide enough power to detect any difference if one exists.
Kolmogorov-Smirnov Tests Library
Download: media:Kolmogorov-Smirnov Tests Library.ana
To use this library, use File / Add Library... from the File menu after downloading.
This library provides two functions:
KS_test_one_dist( x, I, refCdf )
- Tests whether the empirical distribution of data «x», which is indexed by «
I
», matches a hypothesized distribution whose cumulative probability function is passed as the 3rd parameter. KS_test_two_dist( xi, xj, I, J )
- Tests whether the data in «xi» and data in «xj» appear to be from the same distribution. «xi» is indexed by «
I
» and «xj» is indexed by «J
».
Both functions return the p-value.
- If p-value < 0.05
- Conclude (with statistical significance 0.05) that the distributions differ.
- If p-value > 0.05
- You can't conclude that they have the same or different distributions. Just that there is not enough statistical power in the test, or enough data. to detect a difference in distribution if one exists.
Using KS_test_one_dist
To do a one-sample test, you need to provide a cdf function for your hypothesized distribution. You can do this either by creating a new global UDF, or by using a Local function.
To test whether the data in X
is distributed as Gamma( 5.9, 1.65 )
, you can first create a UDF:
Function MyHypothCDF(x) ::= CumGamma( x, 5.9, 1.65)
And then pass this to the function using:
KS_test_one_dist( x, I, MyHypothCDF )
Alternately, you can use a local function like this:
Function hypCdf(x) : CumGamma( x, 5.9, 1.65 );
KS_test_one_dist( x, I, hypCdf )
It is important to notice that a sole colon is used between the function declaration and its defining expression in the local declaration (i.e., not an assignment operator, :=
).
Using KS_test_two_dist
When comparing two data sets, each may have its own index, and each can be a different length. The usage is easy -- just pass each, then pass the index of each:
KS_test_two_dist( xi, xj, I, J )
Limitations
- The p-value from
KS_test_one_dist
is overly weak when you use the data to estimate parameters of your hypothesized distribution. If you do this and still get a p-value < 0.05, you are safe to conclude that the distributions differ, but in general, the p-value will be more conservative than necessary. In that case, other statistical tests will be more powerful. - Stronger statistical tests exist for specific distributions. KS is weaker since it makes no distributional assumption.
There are some things that should be cleaned up in this model.
- The identifiers that appear in the Examples module don't have a namespace prefix. Hence, they are at risk for name collisions. (If name collisions occur, you'll see a dialog when you Add the library to your model).
- You'll get a warning if you open this in a release prior to Analytica 6.3. You can ignore the warning as long as your release is fairly recent (like 6.x).
See Also
- Hypothesis Testing (A webinar)
- Anderson-Darling test
Enable comment auto-refresher