Solved – Why does the ks.test function give a p-value of 0 for these two samples

kolmogorov-smirnov testr

I have two numeric data vectors: NA_Sales and EU_Sales, which are really similar.
When I plot their ECDFs like this:

plot(ecdf(dataset$NA_Sales))
lines(ecdf(dataset$EU_Sales), col="red")

Then I get this diagram:

enter image description here

Which shows they are really similar. But when I run the Kolmogorov-Smirnov test like this:

ks.test(dataset$NA_Sales,dataset$EU_Sales)

or

ks.test(dataset$NA_Sales,dataset$EU_Sales)$p

Then I get a p-value equal to zero. Why? The diagram shows they are similar distributions.
Isn't ks.test() supposed to tell this by their CDF? I should get a p-value higher than 0.05.

Best Answer

Let us show both the issue that whuber raised (that the difference is large, not small) and the one I raised (that very small differences will be significant) in one figure -- by the simple expedient of using a monotonic transformation of the data to tame the long right tail.

In this case, the transformation $g(x) = x/(1+x)$ will pull all of the data into $[0,1)$, while leaving the Kolmogorov-Smirnov distance unchanged - as any transformation that is strictly monotonic in the range of the data would do.

Plot of transformed ecdf for both data sets with maximum KS distance (0.21358) marked in

With 16598 data points, this size of difference is far into the extreme upper tail. The tiny 5% critical distance marked on the plot is that under the assumptions of the test (so it's for continuous data).

The size of the critical distance grows very slowly in $1/\alpha$; if I have calculated it correctly, just doubling the size of that small critical difference would correspond to a significance level of less that $10^{-5}$ for continuous data.

However, because these data are discrete, that's highly conservative (i.e. actually corresponds to a much lower significance level); because of that any calculated p-values are considerably too large.

Note also that these data are paired; that also violates the independence assumption of the test (indeed a plot shows they're substantially correlated). Because of that and the discreteness issue, you really can't put any weight on the actual numeric p-value given by the test -- it's not a meaningful number here.

Further, these are data over time in a changing market; the values within each variable represent a mixture of different distributions, and they are also not going to be independent of each other. The value in trying to use a formal test in this way is highly dubious.

Related Question