Solved – Reproducibility of the two-sample Kolmogorov–Smirnov test

hypothesis testingkolmogorov-smirnov test

I am using the two-sample Kolmogorov–Smirnov test to check if two datasets have the same underlying distribution. I can regenerate those two datasets as many times as I want. When I apply the test, however, the computed p-values differ quite dramatically from each other. I am wondering what could be the reason and how one could go about it.

To give an example, here are the p-values that the test yields (I am using MATLAB’s kstest2) when I apply it to 10 pairs of datasets with 10000 samples each (a significance level of 0.05 is assumed):

8.93e-02
1.10e-01
2.41e-05 (reject)
8.52e-01
3.78e-03 (reject)
2.22e-01
2.86e-04 (reject)
3.85e-04 (reject)
1.36e-02 (reject)
9.02e-03 (reject)

As you can see, the results are quite inconsistent. Almost 50/50. I do not know how to interpret these results and would be grateful for any help. In particular, I am interested to know

  1. if there is something that I can change in my experimental setup to make the test more consistent/reproducible and

  2. if there is a sensible way to summarize multiple p-values into a single one like averaging.

What is also interesting is that drawing only 1000 samples makes many tests pass:

2.35e-01
9.88e-01
5.29e-01
3.94e-01
1.60e-01
1.76e-01
2.63e-03 (reject)
2.82e-01
1.94e-01
2.14e-01

The dependency on the sample size is worrying me as it seems that I can make my test pass by simply reducing the number of samples, which sounds like cheating.

Thank you!

EDIT (more context):

As requested, I would like to provide additional details about what I am actually doing. I have two deterministic algorithms: A and B. The algorithms have random inputs with known distributions. A is the ground truth, and B is an approximation to A. B is supposed to produce statistically the same outputs as A does. I would like to measure how well B approximates A, and my original idea was to apply the Kolmogorov–Smirnov test and monitor the corresponding p-values.

Regards,
Ivan

Best Answer

p-values are random quantities; they're a function of your (random) samples, so naturally they vary from sample to sample.

[Indeed, for a point null hypothesis, and with a continuous rather than discrete test statistic, then if $H_0$ were true, the p-values generated in this manner would be uniform on (0,1).]

As the situation moves further and further from the null in the manner measured by the test statistic, the distribution of p-values skews toward the low end. The p-value is always random (with a long tail to the larger p-values), but it tends to be stochastically smaller.

So you're simply expecting something that won't happen. The p-values will never be particularly consistent in value; they don't tend to concentrate around some underlying "population" value.

What you're seeing is how hypothesis tests work.

Your results - considered all together - already indicate the null is false (point nulls are rarely true, so this is not much of a surprise). If you can take larger samples, your typical p-values will become smaller. If you can't take larger samples but can generate as many hypothesis tests as you wish, you could combine them ... and get an overall p-value smaller than any given positive bound.

From the little bit of information you give in your description, it doesn't sound like a hypothesis test is a good choice for your situation. If you describe your underlying problem in greater detail (what you're trying to achieve before you get to the point of deciding to use hypothesis tests), it may be that alternatives more suited to your needs could be suggested.

Answering the question 'do they differ' is probably pointless (they do, at least a little -- enough data will tell you this). A more useful question is probably nearer to "are they different enough that it matters?"