Solved – Does the 2-sample KS test work? If so, why is it so unintuitive

distributionskolmogorov-smirnov teststatistical significance

Here is an example. I have two data sets as shown below as histograms:

enter image description here

I want to examine whether these data sets are drawn from the same underlying distribution. To do this, I am using the 2-sample Kolmogorov-Smirnov test. This test works by computing the empirical cumulative distribution function for each data set and then measuring the maximum distance between the two ECDFs. Here are the ECDFs:

enter image description here

Looking at it visually, this looks like a no-brainer: these are from the same distribution! The histograms look very similar and the maximum difference between the two ECDFs is tiny. But, to my surprise, the KS test rejects the null-hypothesis! The P-value is very small (p = 0.0011) suggesting that the two data sets are actually very likely drawn from different distributions.

What's going on here? Am I missing something? Is the KS test the wrong test to use?

Any help is appreciated.

Best Answer

One reason for a formal test is to save us from having to eyeball these things. Maybe they look the same, but intuition can be deceiving. How close is close? I don't necessarily trust myself to judge that. $N(0, 1)$ and $N(0, 1.0001)$ are not the same, but you'd have a hard time telling just by looking at ECDFs of draws from either. And looking at the plots you provide, those look pretty different to me. You should compare your two sets of draws using probability plots, and see if you still think it's a no-brainer.

Anyway, the KS test operates like any other hypothesis test. In a finite sample, we expect the two ECDFs to differ somewhat due to sampling error, even if the underlying population CDFs are the same. You compute the biggest pointwise difference between the ECDFs, and the KS test asks "for this sample size, how likely is it that I would get ECDFs that differ by this much or more if the underlying population CDFs are in fact the same?" If you have a large enough sample, even a seemingly small difference in the ECDFs could be considered "too big." It's basically a statement that "if you have this many draws, and there is still a perceptible difference in the ECDFs, it is likely that the population CDFs differ."

As was pointed out here, in a big enough sample the KS test is going to alert you to the difference between $N(0, 1)$ and $N(0, 1.0001)$, but do you really care? Depends on the application.