Solved – Should I sub-sample very large datasets to run the Kolmogorov-Smirnov (KS) test

goodness of fitkolmogorov-smirnov testsample-size

I have two lists with many millions of values (one has 21,410,024,757 values and the other 10,561,427 values). When I run the two-sample KS test the p-value is 0, but many posts suggest that for such large datasets the KS test isn't reliable, e.g:
Kolmogorov-Smirnov test statistic interpretation with large samples

Is it a good approach to sub-sample the two lists of values randomly and uniformly?

Best Answer

In effect what you are asking for is a p-hack.

The KS test is not "unreliable" with large $n$. Quite the opposite. It is overly powered. What does the KS tell you? If two distributions are different. Lots of data means lots of power. And you now have evidence that the first population with a sample of 21 million is different than the second population with a sample of 10 million. I don't think we should be surprised by this.

The problem is that you didn't think about this before running the test. If you now set the alpha level to some very low value, you are forced to decide whether to set this value to call the test significant or non-significant, you already know the p-value. It begs the question why run the test at all?

Also a note about reporting p-values. A p-value is never 0. Even when the sample is identical in the two arms, as a formality, you should report the p-value as p < 0.001 (or whatever significant figure you choose to report p-values); declaring something to be impossible can lead to a serious diagnosis of foot-in-mouth syndrome.

I find that tests are not useful summaries of distributional differences. It's surprising how informative graphics can be to summarize these values instead. Consider simply showing two overlayed estimates of the density using a filtering process. That was if the "difference" leading to a significant result is some small, ignorable facet, then the graph tells us there is no practical difference to mind.

Related Solutions

Solved – Two sample Kolmogorov-Smirnov test and p-value interpretation

If you are using the traditional 0.05 alpha level cutoff then all but group 3 are significantly different from your full group. It is a little easier to see this if the p-values are not in scientific notation ( you can use options(scipen=5) in R to make this less likely). Also group 1 becomes non-significant for some adjustments for multiple tests. You should consider whether that adjustment applies in your case or not. Also note that the groups that are not significant could be different, just low power.

But that just means that any differences, however small, are not easily explained by chance. It could be that your groups are close enough for practical purposes. It is usualy more meaningful to plot the data to see how different the distributions are. You could use the qqplot function as one approach. The vis.test function in the TeachingDemos package for R gives another approach.

One possible hitch is if your groups are part of the "Full" data set as well, then you don't have the independence assumed (but given the sample sizes, I am not sure how much this would affect things). You could address this by taking random samples from the full data set and computing the KS-distance for each (ignore the p-value), then compare where your actual data falls relative to the random samples.

Most of this comes down to what question you really want answered, many of the exact distributional tests answer a different question than the researcher is really interested in.

Kolmogorov-Smirnov Test – How to Perform a Kolmogorov-Smirnov Two-Sample Test

I am assuming you are asking because the Suanshu help page reports in reference to the K-S distribution, "This is not done yet." Luckily, it is very easy to do in R. If x and y are your two samples, ks.test(x,y) returns the test statistic and pvalue. For example,

> x <- rnorm(50)
> y <- runif(30)
> ks.test(x, y)    
        Two-sample Kolmogorov-Smirnov test    
data:  x and y 
D = 0.5, p-value = 9.065e-05
alternative hypothesis: two-sided

By default, it will compute exact or asymptotic p-values based on the product of the sample sizes (exact p-values for n.x*n.y < 10000 in the two-sample case), or you can specify this option with a third argument, exact=F or exact=T. Exact p-values are calculated using the methods of Marsaglia, et al. (2003), which the Suanshu documentation also cites. Some large sample approximations are given here, although I don't have a proper citation. Lastly, if you don't want to install R, there are web calculators for the two-sample K-S test, although I don't know if they use the same algorithm as R because the one I found only reported three decimal points for the p-value.

Best Answer

Related Solutions

Solved – Two sample Kolmogorov-Smirnov test and p-value interpretation

Kolmogorov-Smirnov Test – How to Perform a Kolmogorov-Smirnov Two-Sample Test

Related Question