Solved – Kolmogorov-Smirnov (ks_2samp) p-value not as expected – Wrong test or understanding

distributionshypothesis testingkolmogorov-smirnov testscipy

Context

I am using scipy's ks_samp in order to apply the Kolmogorov-Smirnov-test.

The data I use is twofold:

  1. I have a dataset d1 which is an evaluation-metric applied on the forecast of a machine-learning model m1 (namely the MASE – Mean Average Scaled Error). These are around 6.000 data points meaning the MASE-result of 6.000 forecasts using m1.
  2. My second dataset d2 is analogous to d1 with the difference that I used a second model m2, which slightly differs from m1.

The distribution of both datasets looks like:

d1
d1

d2
d2

As can be seen, the distribution looks pretty much alike. I wanted to underline this fact with a Kolmogorov-Smirnov test. However, the results I get applying k2_samp indicate the contrary:

from scipy.stats import ks_2samp

k2_samp(d1, d2)

# Ks_2sampResult(statistic=0.04779414731236298, pvalue=3.8802872942682265e-10)

As I understand, such a pvalue indicates that the distribution is not alike (rejection of H0). But as can be seen on the images it definitely should.

Questions

  1. Am I misunderstand the usage of Kolmogorov-Smirnov and this test is not applicable for the use-case/kind of distribution?
  2. If first can be answered with yes, what alternative do I have?

Edit

Below is the overlay-graph. Concluding from your answers and comments I assume that the divergence in the "middle" might be the cause since KS is sensitive there.
Overlay

Best Answer

A P-value below 0.05 would indicate that the two samples are from different distributions. Your P-value is smaller than 0.05, so you would reject the null hypothesis that the two samples are from the same distribution.

A difficulty with the Kolmogorov-Smirnov test, used with large sample sizes, is that small, unimportant differences between two samples are sometimes detected as 'significantly different'.

Here are two large samples of size $n = 4000$ generated from the same distribution in R:

set.seed(824)
x1 = rnorm(4000, 100, 15);  x2 = rnorm(4000, 100, 15)

A K-S test in R (correctly) does not find a difference between them:

ks.test(x1, x2)

        Two-sample Kolmogorov-Smirnov test

data:  x1 and x2
D = 0.0165, p-value = 0.6476
alternative hypothesis: two-sided

By contrast, here are two large samples from slightly different distributions, for which the K-S test (correctly) rejects the null hypothesis with a P-value smaller than 0.05.

set.seed(2019)
y1 = rnorm(4000, 99, 15);  y2 = rnorm(4000, 100, 15)
ks.test(y1,y2)

        Two-sample Kolmogorov-Smirnov test

data:  y1 and y2
D = 0.03625, p-value = 0.01043
alternative hypothesis: two-sided

Addendum, as per Comment: Two ECDF plots separately look the same; with overlay, the slight difference may be visible.

par(mfrow=c(1,3))
 plot(ecdf(y1), col="blue"); plot(ecdf(y2), col="orange")
 plot(ecdf(y1), col="blue", main="Overlay")
  lines(ecdf(y2), col="orange")
par(mfrow=c(1,1))

enter image description here

enter image description here

In this specific example, the difference is is population means. Because data are normal, a two-sample t test 'finds' this difference:

t.test(y1, y2)

        Welch Two Sample t-test

data:  y1 and y2
t = -3.4689, df = 7997.9, p-value = 0.0005253
alternative hypothesis: 
  true difference in means is not equal to 0
95 percent confidence interval:
 -1.8404842 -0.5114389
sample estimates:
mean of x mean of y 
 98.39611  99.57208