Solved – Should I sub-sample very large datasets to run the Kolmogorov-Smirnov (KS) test

goodness of fitkolmogorov-smirnov testsample-size

I have two lists with many millions of values (one has 21,410,024,757 values and the other 10,561,427 values). When I run the two-sample KS test the p-value is 0, but many posts suggest that for such large datasets the KS test isn't reliable, e.g:
Kolmogorov-Smirnov test statistic interpretation with large samples

Is it a good approach to sub-sample the two lists of values randomly and uniformly?

Best Answer

In effect what you are asking for is a p-hack.

The KS test is not "unreliable" with large $n$. Quite the opposite. It is overly powered. What does the KS tell you? If two distributions are different. Lots of data means lots of power. And you now have evidence that the first population with a sample of 21 million is different than the second population with a sample of 10 million. I don't think we should be surprised by this.

The problem is that you didn't think about this before running the test. If you now set the alpha level to some very low value, you are forced to decide whether to set this value to call the test significant or non-significant, you already know the p-value. It begs the question why run the test at all?

Also a note about reporting p-values. A p-value is never 0. Even when the sample is identical in the two arms, as a formality, you should report the p-value as p < 0.001 (or whatever significant figure you choose to report p-values); declaring something to be impossible can lead to a serious diagnosis of foot-in-mouth syndrome.

I find that tests are not useful summaries of distributional differences. It's surprising how informative graphics can be to summarize these values instead. Consider simply showing two overlayed estimates of the density using a filtering process. That was if the "difference" leading to a significant result is some small, ignorable facet, then the graph tells us there is no practical difference to mind.