Solved – Kolmogorov-Smirnov vs Mann-Whitney U When There Are Ties

kolmogorov-smirnov testtieswilcoxon-mann-whitney-test

I have a dataset consisting of rank data, some 100 cases and 2 groups. (The 2 groups contain about 1/3 and 2/3 of the cases.) I would like to test whether the two groups differ with respect to median rank. I used a Mann-Whitney U test. A colleague suggested that when there are many ties, a Kolmogorov-Smirnov test is more accurate. Is that so? To what extent? In any case, the M-W test shows statistical significance and the K-S does not, so my questions are two: (1) Do ties affect the alpha error rate in the M-W test or just the beta error rate? (2) How do the the K-S test (which tests more than just the median) and the M-W U test compare when attempting to detect diferences in median (with respect to power and alpha error)? In short, which test do I trust?

Best Answer

I'm not sure what the basis is for your colleague's claim -- but they should support the claims they make before you accept them as true -- there's an astonishing amount of misinformed folklore about. (How do they know that this is true? Do you have good reason to think it must be true in your case?)

Both tests assume$^\dagger$ continuous distributions and both are impacted by ties (however, it's relatively easy to deal with ties in the Mann-Whitney and some software will do so automatically).

$\dagger$ Edit: To support my claim of the assumption of continuity in respect of the Mann-Whitney (since whuber says I am wrong on this point, I had better justify it), I refer to the beginning of Mann and Whitney (1947):

1. Summary. Let $x$ and $y$ be two random variables with continuous cumulative distribution functions $f$ and $g$.

So for Mann and Whitney's version of the test, they do explicitly assume continuity - and not idly, since they do rely on it in their derivation. However, it's possible (as I mention later) to deal with ties in the Mann-Whitney by working out the distribution of the test statistic at the null under the pattern of ties, or by correctly computing the effect of ties on the variance of the statistic under the normal approximation (what's usually referred to as the 'adjustment for ties').

For both tests, if the effect of the ties are not properly dealt with, both kinds of error rate are impacted - their type I error rates are lowered, and lowering the significance level necessarily lowers power ($=1-\beta$).

It's not 100% clear to me which test might be the most impacted, nor under what circumstances, but offhand I'd have expected the greater sensitivity generally went with the KS test* - and this is even before one 'adjusts' the Mann-Whitney for ties (i.e. if you used the normal approximation and used the variance for the no ties case).

*(personally, I'd use simulation suited to the specific instance to see what the properties would be under the sorts of conditions you see, at those sample sizes.)

Below is an illustration of the impact on the distribution of p-values under identical population distributions with of a moderate level of ties$^\ddagger$ with sample sizes of 33 and 67 under the default settings in R (which for the Mann-Whitney uses the normal approximation with correct calculation of variance in the presence of ties for this sample size):

enter image description here

For the tests to work 'as advertized' under the null, these distributions should look close to uniform. As you see, the Mann-Whitney (at least when properly calculating the variance of the sum of the ranks under the presence of ties, as here) is indeed very close to uniform. Since (as we can see) for the Kolmogorov-Smirnov test the proportion of p-values below $\alpha$ will be much smaller than $\alpha$, the test is highly conservative, with corresponding effects on power. [If anything, the effect is somewhat stronger than I'd have anticipated.]

$\ddagger\,$(the impact on the variance of the test statistic is fairly small in percentage terms)

Further, if your interest lies in a location-shift alternative, the Mann-Whitney would have greater power against that alternative to start with, so even if it did lose more power as a result of the discreteness (which I doubt), it may still have more power afterward.

You don't say how heavily tied your data are, nor in what sort of pattern. If both tests are more impacted than you're prepared to accept, you can work with the permutation distribution of either test statistic for you data (or with the permutation distribution of some other statistic, including a difference in sample medians if you wish).

In spite of many books (especially in some particular areas of application) stating that it is, the Mann-Whitney is not actually a test for a difference in medians. However, if you additionally assume that the populations distributions are the same under the null, and restrict the alternative to a location-shift, then it's a test for difference in any reasonable location measure - population medians, population lower quartiles, even population means (if they exist).

Indeed, one needn't restrict oneself to location shift alternatives. Assuming identical distributions under the null against an alternative that will move medians (or any other measure of location) will work; so for example, it would work perfectly well that way as a test of medians under an assumption of scale-shift. We must keep in mind however, that the Mann-Whitney is a far more general test than that and that when we rely on an assumption to make it a test for medians or whatever, we do actually lean on our assumption for the conclusion to make it mean what we want it to.

In short, which test do I trust?

Don't simply trust what anyone says (including me!) - unless they have solid evidence (I haven't brought any that's directly relevant to your situation,, and none relating to power because I haven't seen your pattern of ties and I am not 100% sure whether you're only interested in location shifts).

What kind of data do you have (what are you measuring, how are you measuring it, and how do ties arise)? What are you interested in finding out? Why do you mention medians?

Use simulation to find out how any tests you contemplate behave in circumstances similar to yours, and decide for yourself whether there's a problem to worry about. For both tests, see what the impact of ties is on the test, both under the null and under alternatives you care about, and then the case of the Mann-Whitney, see the effect of the adjustment for ties, and compare it with dealing with the exact permutation distribution (or in large samples like yours, with the randomization distribution). For the KS you can look at the exact permutation distribution as well.

Related Solutions

Mann-Whitney U Test – Conducting Mann-Whitney U Test and K-S Test with Unequal Sample Sizes

With such large sample sizes both tests will have high power to detect minor differences. The 2 distributions could be almost identical with a small difference in shape location that is not of practical importance and the tests would reject (because they are different).

If all you really care about is a statistically significant difference then you can be happy with the results of the KS test (and others, even a t-test will be meaningful with non-normal data of those sample sizes due to the Central Limit Theorem).

If you care about practical or meaningful differences then things become subjective, but you can compare using various plots to help you decide if you think there are differences that are enough to care about.

Another possibility is doing a visual test as documented in

 Buja, A., Cook, D. Hofmann, H., Lawrence, M. Lee, E.-K., Swayne,
 D.F and Wickham, H. (2009) Statistical Inference for exploratory
 data analysis and model diagnostics Phil. Trans. R. Soc. A 2009
 367, 4361-4383 doi: 10.1098/rsta.2009.0120

The vis.test function in the TeachingDemos package for R helps implement the test, but it can be done by hand as well.

Basically you create a bunch of graphs and then see if you can tell which is which. For your question one possibility would be to create a histogram of the 122,000 observations from the one month, then take several samples of 122,000 from the 300,000 observations of the other month and create histograms of each of those samples. Then present someone (or several someones) with all the histograms in random order and see if they can pick out the one that represents the second month. If they consistently pick out the correct graph then that says there is something visually different and you can further explore how they differ. If they don't pick out the correct graph then that suggests that while there may be a statistally significant difference, it is not important enough to distinguish them visually.

Solved – Mann-Whitney null hypothesis under unequal variance

The Mann-Whitney test is a special case of a permutation test (the distribution under the null is derived by looking at all the possible permutations of the data) and permutation tests have the null as identical distributions, so that is technically correct.

One way of thinking of the Mann-Whitney test statistic is a measure of the number of times a randomly chosen value from one group exceeds a randomly chosen value from the other group. So the P(X>Y)=0.5 also makes sense and this is technically a property of the equal distributions null (assuming continuous distributions where the probability of a tie is 0). If the 2 distributions are the same then the probability of X being Greater than Y is 0.5 since they are both drawn from the same distribution.

The stated case of 2 distributions having the same mean but widely different variances matches with the 2nd null hypothesis, but not the 1st of identical distributions. We can do some simulation to see what happens with the p-values in this case (in theory they should be uniformly distributed):

> out <- replicate( 100000, wilcox.test( rnorm(25, 0, 2), rnorm(25,0,10) )$p.value )
> hist(out)
> mean(out < 0.05)
[1] 0.07991
> prop.test( sum(out<0.05), length(out), p=0.05 )

        1-sample proportions test with continuity correction

data:  sum(out < 0.05) out of length(out), null probability 0.05
X-squared = 1882.756, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.05
95 percent confidence interval:
 0.07824054 0.08161183
sample estimates:
      p 
0.07991

So clearly this is rejecting more often than it should and the null hypothesis is false (this matches equality of distributions, but not prob=0.5).

Thinking in terms of probability of X > Y also runs into some interesting problems if you ever compare populations that are based on Efron's Dice.

Best Answer

Related Solutions

Mann-Whitney U Test – Conducting Mann-Whitney U Test and K-S Test with Unequal Sample Sizes

Solved – Mann-Whitney null hypothesis under unequal variance

Related Question