Solved – Different implementations of Kolmogorov-Smirnov test and ties

kolmogorov-smirnov testrscipy

I am trying to understand why I get slightly different results in the K-S test, when I use different implementations of it.

I use samples a of length 45 and b of length 1000, which you can find here. Both a and b contain duplicate values, so there are ties. How does this affect the result of my test? Why does only R complain about this with a warning, while both scipy implementations don't say anything?

scipy.stats.ks_2samp

>>> import scipy.stats
>>> scipy.stats.ks_2samp(a, b)
(0.18788888888888888, 0.084132587804941872)

scipy.stats.mstats.ks_twosamp

>>> import scipy.stats.mstats
>>> scipy.stats.mstats.ks_twosamp(a, b, alternative='two_sided')
(0.18788888888888858, 0.095622608864701905)

ks.test from R

>>> import rpy2.robjects as robjects
>>> ksr = robjects.r['ks.test']
>>> res = ksr(robjects.FloatVector(a), robjects.FloatVector(b))
Warning message:
In function (x, y, ..., alternative = c("two.sided", "less", "greater"),  :
  p-values will be approximate in the presence of ties
>>> print [res[0][0], res[1][0]]
[0.1878888888888889, 0.09562260886470086]

Why do scipy.stats.ks_2samp and scipy.stats.mstats.ks_twosamp return different values? And why do scipy.stats.mstats.ks_twosamp and R's ks.testreturn the same value?

Does this mean I should use scipy.stats.mstats.ks_twosamp if I want to stick to scipy? I'm not too familiar with masked arrays, but I am not using any, so the output shouldn't vary at all, I think.

Best Answer

The first one differs from the other two, probably because of a different method of dealing with ties; the original Kolmogorov-Smirnov test doesn't; it's not possible to tell without documentation or code or something that says exactly what is being done. You presumably at least have the code and help for all of them.

there are ties. How does this affect the result of my test?

The distribution of the test statistic is based on the assumption that the distributions are continuous (so ties are impossible). The distribution is impacted when there are ties, but in such a way that it depends on the particular pattern of ties. Exact answers aren't generally practical and approximations are required.

(If there aren't a large proportion of ties, it won't make a great deal of difference if they're ignored. If there are, it will heavily affect the significance level.)

Why does only R complain about this with a warning, while both scipy implementations don't say anything?

Why would we be able to speculate about the design decisions of others? You'd have to ask the people who wrote it why they decided to do things one way or another.

And why do scipy.stats.mstats.ks_twosamp and R's ks.testreturn the same value?

Because they are computing the same thing, no doubt.

Does this mean I should use scipy.stats.mstats.ks_twosamp if I want to stick to scipy?

I don't see how you have a basis to conclude that from the information given. If you can find out what is being done in each case then you have a basis to decide which fits your situation better; alternatively you can see what the behaviour is on situations similar to yours (that is, you can use simulation to work out how the type I error rate compares to the nominal rate).