Description
I want to use Kolmogorov-Smirnov test to check how given clusters of 1D points differs from normal distribution (original question here: How to test which data match model at best).
I am considering a following approach:
FOREACH cluster
p = points FROM cluster
n = SIZE(p)
mu = AVG(p)
sigma = SQRT(VARIANCE(p))
tmp = GENERATE n RANDOM points FROM normal_distribution(mu, sigma)
result = KS-TEST(SORT(p), SORT(tmp))
IF result > threshold THEN ok OTHERWISE not ok
I took implementation of KS-TEST from here: http://root.cern.ch/root/html/src/TMath.cxx.html#RDBIQ
Number of points is usually hundreds or thousands.
Problem
I have observed that result strongly depends on randomly generated "tmp" points. Even when I randomly generated two sets of points from same distribution with same parameters, the resulting probability from KS-TEST floated between 0.0+something and 0.99+something. So it is difficult for me to choose a proper "threshold" value. The same cluster can be once considered as "close-to-normal-distribution" and once not.
Answer
Can you give me advice, what am I doing wrong, how can I get more reliable results?
Best Answer
There are two standard versions of the Kolmogorov-Smirnov test:
It seems that the code you are using only provides the two-sample version, but your problem is inherently a one-sample goodness-of-fit problem. It would be better to find an implementation of the one-sample test. This would eliminate the needless step of generating the variable 'tmp' and should increase the statistical power of the procedure.
Kolmogorov-Smirnov is often a bad choice since it completely lacks sensitivity at the tails of the distribution. I would recommend trying other tests such as the Anderson Darling test or the Berk-Jones tests.
As for the distribution of test results: this is expected. Under the null hypothesis (that the samples come from exactly the distribution you are testing against) the p-value computed for the Kolmogorov-Smirnov statistic is a Uniform[0,1] random variable.
In fact, this is always true for $p$-values under the null hypothesis when the statistic and the null distribution are continuous. For more information about this fact, see: "Why are p-values uniformly distributed under the null hypothesis?"