Solved – Understanding hypothesis testing (Kruskal Wallis, Mann-Whitney) when there are tied values

hypothesis testingkruskal-wallis test”pythonscipywilcoxon-mann-whitney-test

tldr: In Python's Scipy stats modules, the kruskal and mannwhitneyu functions will both raise an error if called with only identical values. Why?

Context: I'm evaluating results obtained using stochastic algorithms. To do so, I use Kruskal Wallace tests to determine whether two sets of results are significantly different at a certain confidence level, and if they are, I use Mann-Whitney U tests to determine which group is significantly greater.

I'm using scipy.stats.kruskal and scipy.stats.mannwhitneyu in Python to do this for me. Now, one of the algorithms has a very high chance of returning 0, so I have many groups of results that consist entirely of 0s. When I run kruskal on two groups that consist entirely of 0s, the function raises an error like All numbers are identical in kruskal. MWU does the same.

In my context, I think I can safely assume that two groups consisting only of zeros are not significantly different. I would, in fact, expect the test to tell me that they are from the same population with a 100% confidence level. But before I act on my assumption and write a bypass into my script, I would like to understand why the functions are designed to behave in this way. Why are two groups, both only consisting of a single value, considered invalid input for a Kruskal or MWU test? In what scenarios would it be erroneous to conclude that the two groups are identical?

Best Answer

The Wilcoxon Signed-Rank (Mann-Whitney U) test two-sample test and the Kruskal-Wallis test for comparing $k > 2$ samples, are both based on ranks. That is, the numerical values you input are reduced to ranks.

For a single sample, ranking goes like this (output from R):

x = c(1, 4, 7, 3, 11, 6);  x
[1]  1  4  7  3 11  6
rank(x)
[1] 1 3 5 2 6 4

If there are ties in the data, then ranking is not quite so straightforward:

y = c(1, 4, 6, 2, 4, 6, 11);  y
[1]  1  4  6  2  4  6 11
rank(y)
[1] 1.0 3.5 5.5 2.0 3.5 5.5 7.0

Various texts and software packages treat ties differently. In order to find a p-value for the test, all of them begin with with distribution theory based on ranks from data without ties. Some have approximate distribution theory for the less-tidy ranks that result from ties. (When more than one sample is involved, not only duplicated values within a sample count as ties, but duplicated value anywhere among the groups also count.)

If there are only a few ties in a dataset, then printed warnings about ties can often be ignored. If there are many ties, then either no p-value will be given, or the p-value provided may be essentially useless.

An unofficial trick can be used to assess how serious ties are in any one analysis. You can artificially jitter the data with just enough random noise to break ties and see if the p-value changes by enough to matter. (Best to do this a couple of times with different jittering.)

Example:

x = c(1, 2, 2, 4, 5, 3, 0);  y = c(4, 6, 3, 8, 11, 11)
wilcox.test(x, y)

        Wilcoxon rank sum test with continuity correction

data:  x and y
W = 4, p-value = 0.01778
alternative hypothesis: true location shift is not equal to 0

Warning message:
In wilcox.test.default(x, y) : cannot compute exact p-value with ties

.

# Jittering
jx = runif(length(x), -.01, .01);  jy = runif(length(y), -.01, .01)
wilcox.test(x+jx, y+jy)

        Wilcoxon rank sum test

data:  x + jx and y + jy
W = 4, p-value = 0.01399
alternative hypothesis: true location shift is not equal to 0

The original and jittered $x$s look like this:

x; rank(x)
[1] 1 2 2 4 5 3 0
[1] 2.0 3.5 3.5 6.0 7.0 5.0 1.0
round(x+jx, 4); rank(x+jx)
[1]  0.9958  1.9900  2.0035  3.9945  5.0093  2.9995 -0.0095
[1] 2 3 4 6 7 5 1

Another run with different random jittering also gave P-value about 0.014. So if we are working at the 5% level, is seems safe to say that the original version of the test (with warnings about ties) gave a useful result.

If there is any doubt or if the result is for review or publication, then you must do a formal test on the data. My first choice would be a simulated permutation test. One elementary review of permutation tests is Eudey, et al.. (Sect. 3 is most relevant to this discussion.) Also, perhaps see this Q&A. Permutation tests do not use ranks and can handle ties without difficulty.

Finally, I don't know exactly how your groups with all 0's should be treated because I don't know what process is producing the 0's. Clearly not with rank-based tests. You are correct that two groups with all 0's can't be distinguished from one another by hypothesis testing. If one group has a few 0's and another has mostly 0's, a binomial test (0's vs non-0's) or permutation test should help to judge whether they came from different populations.