Kolmogorov-Smirnov Test – Assessing Reliability in Normal Distribution and Clustering

clusteringkolmogorov-smirnov testnormal distribution

Description

I want to use Kolmogorov-Smirnov test to check how given clusters of 1D points differs from normal distribution (original question here: How to test which data match model at best).

I am considering a following approach:

FOREACH cluster
  p = points FROM cluster
  n = SIZE(p)
  mu = AVG(p)
  sigma = SQRT(VARIANCE(p))
  tmp = GENERATE n RANDOM points FROM normal_distribution(mu, sigma)
  result = KS-TEST(SORT(p), SORT(tmp))
  IF result > threshold THEN ok OTHERWISE not ok

I took implementation of KS-TEST from here: http://root.cern.ch/root/html/src/TMath.cxx.html#RDBIQ
Number of points is usually hundreds or thousands.

Problem

I have observed that result strongly depends on randomly generated "tmp" points. Even when I randomly generated two sets of points from same distribution with same parameters, the resulting probability from KS-TEST floated between 0.0+something and 0.99+something. So it is difficult for me to choose a proper "threshold" value. The same cluster can be once considered as "close-to-normal-distribution" and once not.

Answer

Can you give me advice, what am I doing wrong, how can I get more reliable results?

Best Answer

There are two standard versions of the Kolmogorov-Smirnov test:

The one-sample KS, which tests if a sample of points $X_1, \ldots, X_n \in \mathbb{R}$ fits a specific continuous distribution function $F$.
The two-sample KS, which tests whether it is reasonable to assume that two sets of samples $X_1, \ldots, X_n$ and $Y_1, \ldots, Y_m$ come from the same continuous distribution.

It seems that the code you are using only provides the two-sample version, but your problem is inherently a one-sample goodness-of-fit problem. It would be better to find an implementation of the one-sample test. This would eliminate the needless step of generating the variable 'tmp' and should increase the statistical power of the procedure.

Kolmogorov-Smirnov is often a bad choice since it completely lacks sensitivity at the tails of the distribution. I would recommend trying other tests such as the Anderson Darling test or the Berk-Jones tests.

As for the distribution of test results: this is expected. Under the null hypothesis (that the samples come from exactly the distribution you are testing against) the p-value computed for the Kolmogorov-Smirnov statistic is a Uniform[0,1] random variable.

In fact, this is always true for $p$-values under the null hypothesis when the statistic and the null distribution are continuous. For more information about this fact, see: "Why are p-values uniformly distributed under the null hypothesis?"

Related Solutions

Solved – 3d Kolmogorov-Smirnov test

Kolmogorov-Smirnov 3d test is when you have the sample of 3d vectors. The idea is to compare the sample distribution to a model distribution. So the main question is how does model distribution looks like.

Now KS-test compares the cumulative distributions of sample distribution and the model distribution. The 3d test does the same. If you worry about discrete-values look at the bottom of page 6 of your reference. The authors test the behaviour of their statistic when the model distribution (they call it parent distribution) is constant in certain cubes. This means that the the testable distribution is discrete. So the answer to your question seems to be yes, discreteness is not a problem.

To make sure you can always do some Monte-Carlo simulations where you control the model distribution and choose it to be similar to the one you want to test, and see how the statistic performs.

Kolmogorov-Smirnov Test – How to Perform a Kolmogorov-Smirnov Two-Sample Test

I am assuming you are asking because the Suanshu help page reports in reference to the K-S distribution, "This is not done yet." Luckily, it is very easy to do in R. If x and y are your two samples, ks.test(x,y) returns the test statistic and pvalue. For example,

> x <- rnorm(50)
> y <- runif(30)
> ks.test(x, y)    
        Two-sample Kolmogorov-Smirnov test    
data:  x and y 
D = 0.5, p-value = 9.065e-05
alternative hypothesis: two-sided

By default, it will compute exact or asymptotic p-values based on the product of the sample sizes (exact p-values for n.x*n.y < 10000 in the two-sample case), or you can specify this option with a third argument, exact=F or exact=T. Exact p-values are calculated using the methods of Marsaglia, et al. (2003), which the Suanshu documentation also cites. Some large sample approximations are given here, although I don't have a proper citation. Lastly, if you don't want to install R, there are web calculators for the two-sample K-S test, although I don't know if they use the same algorithm as R because the one I found only reported three decimal points for the p-value.