Solved – Mann-Whitney-Wilcoxon Test on highly unbalanced groups.

statistical-powerunbalanced-classeswilcoxon-mann-whitney-test

I have two groups of independent samples whose means I want to compare. The data comes from a distinctly non-normal distribution. (Unimodal, but heavily right skewed).

Let's say that one group (Group A) has 14,000 observations and that another group (Group B) has 700 observations. Thus each group has a decent amount of observations (we are not talking about 2000 versus 5 here).

I am of the opinion that this strong unbalance in the group sizes will not make the Mann-Whitney-Wilcoxon Test any less applicable, however a colleague is suggesting that we randomly sample 700 observations from Group A so that our group sizes are perfectly balanced. He feels that the unbalance of the groups is making the test fail to find the "correct" result. (Where correct result means "result he wants," i.e. that the two groups are statistically significantly different).

As I said, I think that having highly unequal groups does not cause the Mann-Whitney-Wilcoxon Test to somehow stop working.

As far as I understand it, I do believe that the size of the smaller group will have implications for the power of our test, but even there, I am thinking that we probably have at least a decently sufficient sample size for Group B.

Will this highly unbalanced data somehow invalidate my analysis?

Best Answer

In no way will the difference in sample sizes adversely affect the Mann-Whitney-Wilcoxon test. It's explicitly suitable for groups of different sizes, and how different doesn't impact the essential properties of the test.

There's little more to say without some clearer indication of what your colleague thinks the problem is (aside from a burning desire to get a different outcome... which when taken to the point of action is called p-hacking -- even if they were hoping to not reject).

Even if the ratio of sample sizes had been much more extreme -- say $n_1=100000$ and $n_2=2$ there's literally no issue, and no justification I can discern for reducing the larger sample size.

[On the other hand if you're in a position to choose sample sizes beforehand, and you can make the smaller sample size nearer in size to the larger one it may be worth trading a lot of values from the larger sample to get some more in the smaller sample (increasing the smaller sample size from 700 to 770 would be worthwhile even if it meant you could only afford 5000 in the larger sample rather than 14000). That's not what we're discussing here though.]

If your colleague is hoping that a reduction in the larger sample size (by randomly choosing a smaller sample) will make the test more likely to reject, it won't. The power will decrease somewhat (you'll reduce the typical effect size you're able to detect at a given level of power). e.g. if there was 50% power for a given effect size with 14000 and 700, reducing sample size to 700 and 700 would in many situations reduce the power to under 30%.

As a result, if you failed to reject with a large sample size, unless your colleague plans to fudge the sampling there's very little chance of obtaining a rejection (but fudging or not, there's no justification for doing this on the present information).


However - while it sounds like this isn't the problem here - it's often the case that when people are bothered by large size, they're bothered by a rejection at a large sample size. When that happens, it's usually because they have confused statistical significance with any kind of practical meaningfulness. [If they expect a procedure to identify only practical / meaningful differences they really have no business using an ordinary null-hypothesis-type significance test in the first place. That's not what it does and that's not what it's for. With very large sample sizes it will identify very small differences.]

Another common problem with the Mann-Whitney-Wilcoxon is that often people have obtained a misunderstanding (usually straight out of one or another popular text) of what this test actually tests for (for example, expressing amazement that the test rejects equality of medians when the sample medians are identical).

Without more details it's hard to suggest anything more, but with more details there might have been a way to offer some additional thoughts.