Nonparametric – Permutation Test vs Mann-Whitney U Test Comparison

biostatisticsnonparametricpermutation-testwilcoxon-mann-whitney-test

I am wondering if there exist any advantage of using the Mann-Whitney U test over a permutation test in the following setting:

Suppose I have a group of animals with eggs in their brood chamber, from this group I assign some to a control vessel and some to a treatment vessel. Then I wait till the newborns hatch in each vessel and I measure the size of the newborns.

I want to test the hypothesis that the newborns came from the same population, i.e, there is no effect of the treatment with respect to control in the size of newborns. Under this hypothesis, I can construct a permutation test to assess if the mean difference between the size of the animals in each vessel is just random or there is evidence that suggest that the difference could not have happened by chance.

Is there any reason to use Mann–Whitney U test over just this simple permutation test?

Best Answer

There are important reasons to use the Wilcoxon-Mann-Whitney two-sample rank-sum test in this context. The most important reason is that the test is transformation-invariant, i.e., you get the same inference whether you log-transform the weights or not. Secondly, this rank test is more robust to outliers. Third, it extends to more complex situations such as needing covariate adjustment (the proportional odds model is the generalization of the Wilcoxon test).

Related Solutions

Solved – Mann-Whitney U test with very large sample size

This is not a problem of the t-test, but of any test in which the power of the test depends on the sample size. This is called "overpowering". And yes, changing the test to Mann-Whitney will not help.

Therefore, apart from asking whether the results are statistically significant, you need to ask yourself whether the observed effect size is significant in the common sense of the word (i.e., meaningful). This requires more than statistical knowledge, but also your expertise in the field you are investigating.

In general, there are two ways you can look at the effect size. One way is to scale the difference between the means in your data by its standard deviation. Since standard deviation is in the same units as your means and describes the dispersion of your data, you can express the difference between your groups in terms of standard deviation. Also, when you estimate the variance / standard deviation in your data, it does not necessarily decrease with the number of samples (unlike standard deviation of the mean).

This is, for example, the reasoning behind Cohen's $d$:

$$d = \frac{ \bar{x}_1 - \bar{x}_2 }{ s}$$

...where $s$ is the square root of the pooled variance.

$$s = \sqrt{\frac{ s_1^2\cdot(n_1-1) + s_2^2\cdot(n_2 - 1) }{ N - 2 } }$$

(where $N=n_1+n_2$ and $s_1$ and $s_2$ are the standard deviations in group 1 and 2, respectively; that is, $s_1 = \sqrt{ \frac{\sum(x_i-\bar{x_1})^2 }{n_1 -1 }} $).

Another way of looking at the effect size -- and frankly, one that I personally prefer -- is to ask what part (percentage) of the variability in the data can be explained by the estimated effect. You can estimate the variance between and within the groups and see how they relate (this is actually what ANOVA is, and t-test is in principle a special case of ANOVA).This is the reasoning behind the coefficient of determination, $r^2$, and the related $\eta^2$ and $\omega^2$ stats. Now, in a t-test, $\eta^2$ can easily be calculated from the $t$ statistic itself:

$$\eta^2 = \frac{ t^2}{t^2 + n_1 + n_2 - 2 }$$

This value can be directly interpreted as "fraction of variance in the data which is explained by the difference between the groups". There are different rules of thumb to say what is a "large" and what is a "small" effect, but it all depends on your particular question. 1% of the variance explained can be laughable, or can be just enough.

Wilcoxon Mann Whitney Test – Criteria for Data Shape in Mann Whitney U Test

If two samples have roughly the same shape, then the Mann-Whitney-Wilcoxon test (a rank sum test) can be considered a test whether the locations (often expressed as medians) differ. Consider fictitious data sampled in R.

set.seed(104)
y1 = rgamma(100, 3, 1/3)
y2 = rgamma(100, 3, 1/3) + 4
median(y1);  median(y2)
[1] 7.684493
[1] 11.85169

stripchart(list(y1,y2), ylim=c(.5,2.5), pch="|")

Because the P-value of the Wilcoxon rank sum test is near $0.$ we can say that sample medians $7.68$ and $11.95$ are significantly different at the 1% level.

wilcox.test(y1, y2)

        Wilcoxon rank sum test with continuity correction

data:  y1 and y2
W = 2634, p-value = 7.477e-09
alternative hypothesis: 
 true location shift is not equal to 0

However, if two samples have distinctly different shapes, rejection of the null hypothesis of the Wilcoxon rank sum test should be interpreted to mean that the population distribution of one sample 'stochastically dominates' the population distribution of the other.

set.seed(2022)
x1 = rgamma(100, 3, 1/3)
x2 = rnorm(100, 12, 3)

summary(x1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.868   5.643   7.962   8.411  11.274  19.239 
summary(x2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  5.654   9.625  11.456  11.756  13.877  19.377 


stripchart(list(x1,x2), ylim=c(.5,2.5), pch="|")

wilcox.test(x1,x2)

        Wilcoxon rank sum test with continuity correction

data:  x1 and x2
W = 2531, p-value = 1.624e-09
alternative hypothesis: 
 true location shift is not equal to 0

Stochastic domination means that values in the second population tend to be larger than values in the first. Perhaps this is best illustrated by showing the empirical CDFs (ECDFs) of the two samples. The dominating ECDF (blue in the figure below) plots to the right of the other ECDF, and thus below.

hdr  = "ECDFs of x1 and x2 (blue)"
plot(ecdf(x1), xlim=c(0, 20), main=hdr)
lines(ecdf(x2), col="blue")

Best Answer

Related Solutions

Solved – Mann-Whitney U test with very large sample size

Wilcoxon Mann Whitney Test – Criteria for Data Shape in Mann Whitney U Test

Related Question