R – Two-Sample Kolmogorov-Smirnov Test P-Value Confusion

kolmogorov-smirnov testp-valuer

I'm confused about the appropriate interpretation of p-values returned by the two-sample Kolmogorov-Smirnov test (ks.test) in R.

In slide 23 of this presentation about non-parametric two-sample tests, the author states that when analyzing the ks.test results:

ks.test(male, female)
Two-sample Kolmogorov-Smirnov test
data: male and female 
D = 0.8333, p-value = 0.02597

the p-value

needs to be multiplied by 2 for a 2-tail test. Thus, P = 0.05194

Is that true?

If we used the original p = 0.02597, we would reject the hypothesis that the distributions similar, because p < 0.05, correct? Whereas if we multiply it by 2, the p would suggest that there is no difference between distributions, since p > 0.05?

What am I missing?

Best Answer

No, it's wrong. The default Kolmogorov-Smirnov in R is already two sided (i.e. already tests $F_X\neq F_Y$ rather than $F_X<F_Y$ or $F_X>F_Y$ (in all three cases, we should add "somewhere").

If you had done a one-tailed test but intended to do a two tail test (and if the sample turned out to have a difference in the direction you tested for), it's usually reasonably-near-to-correct to double the p-value for a two-tailed test, but strictly speaking, still wrong.

While in the case of the t-test the events of rejecting in each tail are mutually exclusive - so you can just add their probabilities, and symmetric so adding is doubling - for the Kolmogorov-Smirnov they're not mutually exclusive -- each of the one-tailed Kolmogorov-Smirnov tests can reject on the same sample. However, under the null it's relatively rare to be able to reject both directions and so it's generally not a bad approximation to double.

It's just unnecessary, since the ks.test function will happily calculate two-tailed p-values for us without doing a thing -- in fact we have to explicitly ask for a one-tailed one.

Related Solutions

Solved – Two sample Kolmogorov-Smirnov test and p-value interpretation

If you are using the traditional 0.05 alpha level cutoff then all but group 3 are significantly different from your full group. It is a little easier to see this if the p-values are not in scientific notation ( you can use options(scipen=5) in R to make this less likely). Also group 1 becomes non-significant for some adjustments for multiple tests. You should consider whether that adjustment applies in your case or not. Also note that the groups that are not significant could be different, just low power.

But that just means that any differences, however small, are not easily explained by chance. It could be that your groups are close enough for practical purposes. It is usualy more meaningful to plot the data to see how different the distributions are. You could use the qqplot function as one approach. The vis.test function in the TeachingDemos package for R gives another approach.

One possible hitch is if your groups are part of the "Full" data set as well, then you don't have the independence assumed (but given the sample sizes, I am not sure how much this would affect things). You could address this by taking random samples from the full data set and computing the KS-distance for each (ignore the p-value), then compare where your actual data falls relative to the random samples.

Most of this comes down to what question you really want answered, many of the exact distributional tests answer a different question than the researcher is really interested in.

Kolmogorov-Smirnov Test – How to Perform a Kolmogorov-Smirnov Two-Sample Test

I am assuming you are asking because the Suanshu help page reports in reference to the K-S distribution, "This is not done yet." Luckily, it is very easy to do in R. If x and y are your two samples, ks.test(x,y) returns the test statistic and pvalue. For example,

> x <- rnorm(50)
> y <- runif(30)
> ks.test(x, y)    
        Two-sample Kolmogorov-Smirnov test    
data:  x and y 
D = 0.5, p-value = 9.065e-05
alternative hypothesis: two-sided

By default, it will compute exact or asymptotic p-values based on the product of the sample sizes (exact p-values for n.x*n.y < 10000 in the two-sample case), or you can specify this option with a third argument, exact=F or exact=T. Exact p-values are calculated using the methods of Marsaglia, et al. (2003), which the Suanshu documentation also cites. Some large sample approximations are given here, although I don't have a proper citation. Lastly, if you don't want to install R, there are web calculators for the two-sample K-S test, although I don't know if they use the same algorithm as R because the one I found only reported three decimal points for the p-value.

Best Answer

Related Solutions

Solved – Two sample Kolmogorov-Smirnov test and p-value interpretation

Kolmogorov-Smirnov Test – How to Perform a Kolmogorov-Smirnov Two-Sample Test

Related Question