Solved – Understanding Kolmogorov-Smirnov test in R

kolmogorov-smirnov testrties

I'm trying to understand the output of the Kolmogorov-Smirnov test function (two samples, two sided).
Here is a simple test.

x <- c(1,2,2,3,3,3,3,4,5,6)
y <- c(2,3,4,5,5,6,6,6,6,7)
z <- c(12,13,14,15,15,16,16,16,16,17)

ks.test(x,y)

#   Two-sample Kolmogorov-Smirnov test
#
#data:  x and y
#D = 0.5, p-value = 0.1641
#alternative hypothesis: two-sided
#
#Warning message:
#In ks.test(x, y) : cannot compute exact p-value with ties

ks.test(x,z)

#Two-sample Kolmogorov-Smirnov test

#data:  x and z
#D = 1, p-value = 9.08e-05
#alternative hypothesis: two-sided
#
#Warning message:
#In ks.test(x, z) : cannot compute exact p-value with ties


ks.test(x,x)

#Two-sample Kolmogorov-Smirnov test

#data:  x and x
#D = 0, p-value = 1
#alternative hypothesis: two-sided
#
#Warning message:
#In ks.test(x, x) : cannot compute exact p-value with ties

There are a few things I don't understand here.

From the help, it seems that the p-value refers to the hypothesis var1=var2. However, here that would mean that the test says (p<0.05):

a. Cannot say that X = Y;

b. Can say that X = Z;

c. Cannot say that X = X (!)

Besides appearing that x is different from itself (!), it is also quite strange to me that x=z, as the two distributions have zero overlapping support. How is that possible?

According to the definition of the test, D should be the maximum difference between the two probability distributions, but for instance in the case (x,y) it should be D = Max|P(x)-P(y)| = 4 (in the case when P(x), P(y) aren't normalized) or D=0.3 (if they are normalized). Why D is different from that?
I have intentionally made an example with many ties, as the data I'm working with have lots of identical values. Why does this confuse the test? I thought it calculated a probability distribution that should not be affected by repeated values. Any idea?

Best Answer

The KS test is premised on testing the "sameness" of two independent samples from a continuous distribution (as the help page states). If that is the case then the probability of ties should be astonishingly small (also stated). The test statistic is the maximum distance between the ECDF's of the two samples. The p-value is the probability of seeing a test statistic as high or higher than the one observed if the two samples were drawn from the same distribution. (It is not the "probability that var1 = var2". And furthermore, 1-p_value is NOT the that probability either.) High p-values say you cannot claim statistical support for a difference, but low p-values are not evidence of sameness. Low p-values can occur with low sample sizes (as your example provides) or the presence of interesting but small differences, e.g. superimposed oscillatory disturbances. If you are working with situations with large numbers of ties it suggests you may need to use a test that more closely fits your data situation.

My explanation of why ties were a violation of assumptions was not a claim that ties invalidated the results. The statistical properties of the KS test in practice are relatively resistant or robust to failure of that assumption. The main problem with the KS test as I see is that it is excessively general and as a consequence is under-powered to identify meaningful differences of an interesting nature. The KS test is a very general test and has rather low power for more specific hypotheses.

On the other hand, I also see the KS-test (or the "even more powerful" Anderson Darling or Lillefors(sp?) test) used to test "normality" in situations where such a test is completely unwarranted, such as test for the normality of variables being used as predictors in a regression model before the fit. One might legitimately want to be testing the normality of the residuals since that is what is assumed in the modeling theory. Even then modest departures from normality of the residuals do not generally challenge the validity of the results. Persons would be better of using robust methods to check for important impact of "non-normality" on conclusions about statistical significance.

Perhaps you should consult with a local statistician? It might assist you in defining the statistical question a bit more precisely and therefore have a better chance of identifying a difference if one actually exists. That would be avoidance of a "type II error": failing to support a conclusion of difference when such a difference is present.

Related Solutions

Kolmogorov-Smirnov Test – How to Perform a Kolmogorov-Smirnov Two-Sample Test

I am assuming you are asking because the Suanshu help page reports in reference to the K-S distribution, "This is not done yet." Luckily, it is very easy to do in R. If x and y are your two samples, ks.test(x,y) returns the test statistic and pvalue. For example,

> x <- rnorm(50)
> y <- runif(30)
> ks.test(x, y)    
        Two-sample Kolmogorov-Smirnov test    
data:  x and y 
D = 0.5, p-value = 9.065e-05
alternative hypothesis: two-sided

By default, it will compute exact or asymptotic p-values based on the product of the sample sizes (exact p-values for n.x*n.y < 10000 in the two-sample case), or you can specify this option with a third argument, exact=F or exact=T. Exact p-values are calculated using the methods of Marsaglia, et al. (2003), which the Suanshu documentation also cites. Some large sample approximations are given here, although I don't have a proper citation. Lastly, if you don't want to install R, there are web calculators for the two-sample K-S test, although I don't know if they use the same algorithm as R because the one I found only reported three decimal points for the p-value.

Kolmogorov-Smirnov Test – Alternatives for Tied Data with Correction

Instead of using the KS test you could simply use a permutation or resampling procedure as implemented in the oneway_test function of the coin package. Have a look at the accepted answer to this question.

Update: My package afex contains the function compare.2.vectors implementing a permutation and other tests for two vectors. You can get it from CRAN:

install.packages("afex")

For two vectors x and y it (currently) returns something like:

> compare.2.vectors(x,y)
$parametric
   test test.statistic test.value test.df       p
1     t              t     -1.861   18.00 0.07919
2 Welch              t     -1.861   17.78 0.07939

$nonparametric
             test test.statistic test.value test.df       p
1 stats::Wilcoxon              W     25.500      NA 0.06933
2     permutation              Z     -1.751      NA 0.08154
3  coin::Wilcoxon              Z     -1.854      NA 0.06487
4          median              Z      1.744      NA 0.17867

Any comments regarding this function are highly welcomed.

Best Answer

Related Solutions

Kolmogorov-Smirnov Test – How to Perform a Kolmogorov-Smirnov Two-Sample Test

Kolmogorov-Smirnov Test – Alternatives for Tied Data with Correction

Related Question