Solved – Reproducibility of the two-sample Kolmogorov–Smirnov test

hypothesis testingkolmogorov-smirnov test

I am using the two-sample Kolmogorov–Smirnov test to check if two datasets have the same underlying distribution. I can regenerate those two datasets as many times as I want. When I apply the test, however, the computed p-values differ quite dramatically from each other. I am wondering what could be the reason and how one could go about it.

To give an example, here are the p-values that the test yields (I am using MATLAB’s kstest2) when I apply it to 10 pairs of datasets with 10000 samples each (a significance level of 0.05 is assumed):

8.93e-02
1.10e-01
2.41e-05 (reject)
8.52e-01
3.78e-03 (reject)
2.22e-01
2.86e-04 (reject)
3.85e-04 (reject)
1.36e-02 (reject)
9.02e-03 (reject)

As you can see, the results are quite inconsistent. Almost 50/50. I do not know how to interpret these results and would be grateful for any help. In particular, I am interested to know

if there is something that I can change in my experimental setup to make the test more consistent/reproducible and
if there is a sensible way to summarize multiple p-values into a single one like averaging.

What is also interesting is that drawing only 1000 samples makes many tests pass:

2.35e-01
9.88e-01
5.29e-01
3.94e-01
1.60e-01
1.76e-01
2.63e-03 (reject)
2.82e-01
1.94e-01
2.14e-01

The dependency on the sample size is worrying me as it seems that I can make my test pass by simply reducing the number of samples, which sounds like cheating.

Thank you!

EDIT (more context):

As requested, I would like to provide additional details about what I am actually doing. I have two deterministic algorithms: A and B. The algorithms have random inputs with known distributions. A is the ground truth, and B is an approximation to A. B is supposed to produce statistically the same outputs as A does. I would like to measure how well B approximates A, and my original idea was to apply the Kolmogorov–Smirnov test and monitor the corresponding p-values.

Regards,
Ivan

Best Answer

p-values are random quantities; they're a function of your (random) samples, so naturally they vary from sample to sample.

[Indeed, for a point null hypothesis, and with a continuous rather than discrete test statistic, then if $H_0$ were true, the p-values generated in this manner would be uniform on (0,1).]

As the situation moves further and further from the null in the manner measured by the test statistic, the distribution of p-values skews toward the low end. The p-value is always random (with a long tail to the larger p-values), but it tends to be stochastically smaller.

So you're simply expecting something that won't happen. The p-values will never be particularly consistent in value; they don't tend to concentrate around some underlying "population" value.

What you're seeing is how hypothesis tests work.

Your results - considered all together - already indicate the null is false (point nulls are rarely true, so this is not much of a surprise). If you can take larger samples, your typical p-values will become smaller. If you can't take larger samples but can generate as many hypothesis tests as you wish, you could combine them ... and get an overall p-value smaller than any given positive bound.

From the little bit of information you give in your description, it doesn't sound like a hypothesis test is a good choice for your situation. If you describe your underlying problem in greater detail (what you're trying to achieve before you get to the point of deciding to use hypothesis tests), it may be that alternatives more suited to your needs could be suggested.

Answering the question 'do they differ' is probably pointless (they do, at least a little -- enough data will tell you this). A more useful question is probably nearer to "are they different enough that it matters?"

Related Solutions

Hypothesis Testing – Use Kolmogorov-Smirnov to Compare Two Empirical Distributions

That is OK, and quite reasonable. It is referred to as the two-sample Kolmogorov-Smirnov test. Measuring the difference between two distribution functions by the supnorm is always sensible, but to do a formal test you want to know the distribution under the hypothesis that the two samples are independent and each i.i.d. from the same underlying distribution. To rely on the usual asymptotic theory you will need continuity of the underlying common distribution (not of the empirical distributions). See the Wikipedia page linked to above for more details.

In R, you can use the ks.test, which computes exact $p$-values for small sample sizes.

Kolmogorov-Smirnov Test – How to Perform a Kolmogorov-Smirnov Two-Sample Test

I am assuming you are asking because the Suanshu help page reports in reference to the K-S distribution, "This is not done yet." Luckily, it is very easy to do in R. If x and y are your two samples, ks.test(x,y) returns the test statistic and pvalue. For example,

> x <- rnorm(50)
> y <- runif(30)
> ks.test(x, y)    
        Two-sample Kolmogorov-Smirnov test    
data:  x and y 
D = 0.5, p-value = 9.065e-05
alternative hypothesis: two-sided

By default, it will compute exact or asymptotic p-values based on the product of the sample sizes (exact p-values for n.x*n.y < 10000 in the two-sample case), or you can specify this option with a third argument, exact=F or exact=T. Exact p-values are calculated using the methods of Marsaglia, et al. (2003), which the Suanshu documentation also cites. Some large sample approximations are given here, although I don't have a proper citation. Lastly, if you don't want to install R, there are web calculators for the two-sample K-S test, although I don't know if they use the same algorithm as R because the one I found only reported three decimal points for the p-value.

Best Answer

Related Solutions

Hypothesis Testing – Use Kolmogorov-Smirnov to Compare Two Empirical Distributions

Kolmogorov-Smirnov Test – How to Perform a Kolmogorov-Smirnov Two-Sample Test

Related Question