Difference between revisions of "Anderson-Darling test"
(Created page with "The Anderson-Darling test is used to detect whether a set of univariate continuous samples comes from a known distribution. The Kolmogorov-Smirnov Tests are used for the s...") |
(sample size) |
||
Line 40: | Line 40: | ||
== Notes == | == Notes == | ||
This approach generalizes to any hypothesis test. | This approach generalizes to any hypothesis test. | ||
+ | |||
+ | The computed p-value has some sampling error due to the fact that it is approximated using Monte Carlo sampling. You can reduce this error by increasing the [[Uncertainty Setup dialog|Sample Size]]. A larger sample size takes longer and requires more memory, but is more accurate. However, a default sample size of 1000 is sufficient in almost all cases, especially if you aren't right on the 0.05 threshold. | ||
== See Also == | == See Also == | ||
* [[Kolmogorov-Smirnov Tests]] | * [[Kolmogorov-Smirnov Tests]] | ||
* [[Tutorial_videos#Session_8:_Hypothesis_Testing|Webinar on Hypothesis Testing]] | * [[Tutorial_videos#Session_8:_Hypothesis_Testing|Webinar on Hypothesis Testing]] |
Revision as of 22:41, 27 March 2023
The Anderson-Darling test is used to detect whether a set of univariate continuous samples comes from a known distribution. The Kolmogorov-Smirnov Tests are used for the same thing, but differ in the choice of metric used to measure the distance between distributions. The Anderson-Darling test is more powerful (i.e., more likely to detect a difference in distribution when one exists) in many cases.
However, the Anderson-Darling test does not have a closed-form for the distribution function. Even though the test is distribution-free in theory, in practice, many different critical value tables are compiled for different forms for the null hypothesis distribution. These are usually estimated tables obtained after simulating many examples and then fitting a parametric model to approximate the true values.
Here, instead of assuming any particular distribution, I show how to compute the p-value for any distribution null hypothesis in Analytica. The trade-off is that this is indeed distribution-free, but requires a Monte Carlo simulation to compute every p-value.
The test
Given a sample of data, X
, indexed by I
, where each cell is a scalar, we ask the question: Was this data sampled from the distribution F
? To answer this, we compute the p-value, which is the probability of observing an Anderson-Darling distance greater than the measured distance under the assumption that the data was indeed generated by F
. This assumption is called the null hypothesis. If the computed p-value < 0.05, then we will conclude that there is evidence (i.e., statistically significant evidence) that the data's distribution is not the same as F
.
To perform the test, in addition to the data, you need to be able to compute two things:
- The CDF of the hypothesized distribution at any value
x
, writtenF(x)
. - A random sample from the hypothesized distribution.
If you want to test whether the data comes from a Normal(10,2)
distribution, then you'll need these:
Function F(x) := CumNormal( x, 10, 2 )
{ To compute the CDF }Normal(10,2)
{ To generate the sample }
The Anderson-Darling distance
The Anderson-Darling distance is a metric that compares the distance (in distribution) of a sample of points to a known distribution. It is implemented in Analytica as follows:
Function AD_dist( x : [I] ; I : Index ; F : Function(x) atom )
Definition:
Local xs := Sort(x,I); Local n := Sum(x<>Null, I); Local S := Sum( (2*@I-1)/n * (Ln(F(xs)) + Ln(1-F(Reverse(xs,I)))), I ); -n - S
The distance from your data to the hypothesized distribution is given by:
- Variable A2 :=
AD_dist( X, I, F )
Computing the p-value
To compute the p-value by Monte Carlo simulation, create a chance variable that samples from the hypothesized distribution. For example, if the null hypothesis is that the data comes from a Normal(10,2)
distribution, then your chance variable will be defined as:
- Chance Xr :=
Normal( 10, 2 )
Use two more variables as follows:
- Variable Sim_AD_dist_given_H0 :=
AD_dist( Xr, I, F )
- Variable p_value :=
Probability( Sim_AD_dist_given_H0 > A2 )
Notes
This approach generalizes to any hypothesis test.
The computed p-value has some sampling error due to the fact that it is approximated using Monte Carlo sampling. You can reduce this error by increasing the Sample Size. A larger sample size takes longer and requires more memory, but is more accurate. However, a default sample size of 1000 is sufficient in almost all cases, especially if you aren't right on the 0.05 threshold.
Enable comment auto-refresher