Kolmogorov-Smirnov Test – Assessing Reliability in Normal Distribution and Clustering

clusteringkolmogorov-smirnov testnormal distribution

Description

I want to use Kolmogorov-Smirnov test to check how given clusters of 1D points differs from normal distribution (original question here: How to test which data match model at best).

I am considering a following approach:

FOREACH cluster
  p = points FROM cluster
  n = SIZE(p)
  mu = AVG(p)
  sigma = SQRT(VARIANCE(p))
  tmp = GENERATE n RANDOM points FROM normal_distribution(mu, sigma)
  result = KS-TEST(SORT(p), SORT(tmp))
  IF result > threshold THEN ok OTHERWISE not ok

I took implementation of KS-TEST from here: http://root.cern.ch/root/html/src/TMath.cxx.html#RDBIQ
Number of points is usually hundreds or thousands.

Problem

I have observed that result strongly depends on randomly generated "tmp" points. Even when I randomly generated two sets of points from same distribution with same parameters, the resulting probability from KS-TEST floated between 0.0+something and 0.99+something. So it is difficult for me to choose a proper "threshold" value. The same cluster can be once considered as "close-to-normal-distribution" and once not.

Answer

Can you give me advice, what am I doing wrong, how can I get more reliable results?

Best Answer

There are two standard versions of the Kolmogorov-Smirnov test:

  • The one-sample KS, which tests if a sample of points $X_1, \ldots, X_n \in \mathbb{R}$ fits a specific continuous distribution function $F$.
  • The two-sample KS, which tests whether it is reasonable to assume that two sets of samples $X_1, \ldots, X_n$ and $Y_1, \ldots, Y_m$ come from the same continuous distribution.

It seems that the code you are using only provides the two-sample version, but your problem is inherently a one-sample goodness-of-fit problem. It would be better to find an implementation of the one-sample test. This would eliminate the needless step of generating the variable 'tmp' and should increase the statistical power of the procedure.

Kolmogorov-Smirnov is often a bad choice since it completely lacks sensitivity at the tails of the distribution. I would recommend trying other tests such as the Anderson Darling test or the Berk-Jones tests.

As for the distribution of test results: this is expected. Under the null hypothesis (that the samples come from exactly the distribution you are testing against) the p-value computed for the Kolmogorov-Smirnov statistic is a Uniform[0,1] random variable.

In fact, this is always true for $p$-values under the null hypothesis when the statistic and the null distribution are continuous. For more information about this fact, see: "Why are p-values uniformly distributed under the null hypothesis?"