Solved – Unequal Sample Size and Wilcoxon Rank Sum Test

wilcoxon-mann-whitney-test

It is my understanding that it is okay to perform the Wilcoxon Rank Sum test with slightly unequal sample sizes. I just don't understand how the Wilcoxon accounts for the missing numbers in one group given that it subtracts a random score of one group to a random score of another group.

Also, what would you say is the H1 of the wilcoxon rank sum test is? I've been seeing some different definitions in my readings.

Best Answer

The Wilcoxon Rank sum test doesn't "subtract a random score of one group to a random score of another group".

If it did what you say in the question, that could work perfectly well with even very different sample sizes (since you could sample either with replacement so having the same sample size would be unnecessary), but that's not how it works.

As the name suggests, the rank-sum test sums the ranks in one of the samples. It may then apply a shift (say by subtracting the minimum possible sum of ranks).

[Where did you get the idea? It sounds like someone tried to explain permutation tests to you but they've ended up with a muddle of paired and independent sample and rank vs original-value notions all smooshed together.]

There's not one single alternative for the Wilcoxon Rank Sum test; it depends on what additional assumptions you make and how you look at it. The most general alternative form is that $P(X>Y)\neq \frac12$ (two tailed; the one tailed versions replace $\neq$ with either $<$ or $>$).

Related Solutions

Solved – How many « degrees of freedom » should a Wilcoxon rank-sum test have

For this design-based Wilcoxon (rank-sum) test, the degrees of freedom are the design degrees of freedom: the number of primary sampling units in the design minus the number of strata. These are denominator degrees of freedom for a $t$ or $F$ distribution. In this case, with two groups, it's a $t$ distribution. (With more than two groups there would also be a numerator degrees of freedom for an $F$ distribution, which would be one less than the number of groups; this numerator df is analogous to the df for the $\chi^2$ approximation to the Kruskal-Wallis test)

The use of design degrees of freedom in a central-limit-theorem approximation is very common in survey analysis. Simulations in the paper proposing these design-based rank tests (Lumley, T., & Scott, A. J. (2013). Two-sample rank tests under complex sampling. BIOMETRIKA, 100 (4), 831-842.) show that using the t distribution and approximate degrees of freedom gives better performance.

In this particular case the primary sampling units are the individual observations, and there is only one stratum, so the df is 8-1=7.

I should note that whether this test is what you want depends on what your weights are. svyranktest is for sampling weights, and it compares estimated population ranks rather than comparing sample ranks giving some observations more weight. The test is not exact in small samples.

The data and weights given look unusual for a probability sample -- at the very least, the weights must have been scaled, since sampling weights must be 1 or greater.

T-Test – Choosing Between T-Test and Wilcoxon Rank Sum Test for Statistical Analysis

You should not expect an accurate result from a two-sample t test on samples that are sufficiently far from normal to fail Shapiro-Wilk tests of normality. The P-value about 4% would be just barely significant even if accurate.

If the two samples have approximately the same shape, a Wilcoxon rank sum test might tell you whether population medians are significantly different. However, this test is not quite as powerful as a t test. In any case a P-value about 8% is not impressive evidence for a significant difference between population locations.

@Dave has a good point that you have done too many tests on the data. Cherry picking the smallest P-value of two 2-sample tests would be "P-hacking."

Consider the following fictitious data:

set.seed(1234)
x1 = rexp(30, 1/10);  x2 = rexp(30, 1/15)
mean(x1); mean(x2)
[1] 9.384906
[1] 17.75834

Means are quite different. The issue is whether the difference is statistically significant at, say, the 5% level. Boxplots show strongly right-skewed samples and apparently different dispersions.

x = c(x1,x2);  g = rep(1:2, each=30)
boxplot(x~g, horizontal=T, col="skyblue2")

Normal probability plots are clearly not linear, so the data should not be assumed normal. The Welch t test may or may not give useful results with sample sizes as large as $n_1=n_2 = 30.$

R code for figure:

par(mfrow=c(1,2))
 qqnorm(x1); qqline(x1, col="blue")
 qqnorm(x2); qqline(x2, col="blue")
par(mfrow=c(1,1))

My first (and only) test would be a permutation test using the Welch t statistic as metric. This test does not assume that data are normal, nor that the t statistic has a t distribution. It approximates the distribution of the t statistic for our data. [We look at P-values here because the Welch t test tends to have slightly different degrees of freedom at each iteration.]

pv.obs = t.test(x~g)$p.val; pv.obs
[1] 0.02797518
pv = replicate(10^5, t.test(x~sample(g))$p.val)
mean(pv <= pv.obs)
[1] 0.02633  # Sim. P-value of permutation text

So the permutation test finds a significant difference at the 3% level.

Because the boxplots show different shapes (dispersions), I would stop there.

If you want to know what the pooled t test, and Wilcoxon rank sum test would have given, here are the results. But we have done a valid test already, so these results are to satisfy curiosity, not as valid test results.

t.test(x1,x2, var.eq=T)$p.val # Pooled
[1] 0.02622975
wilcox.test(x1,x2)$p.val      # Wilcoxon SR
[1] 0.2358858

Note: My fictitious right-skewed data for this Answer were sampled from exponential populations. If you know that data are exponential, then there is an exact test. See this Q&A, where it is stated that means of two independent exponential samples, each of size $n$ have $\frac{\bar X_1}{\bar X_2} \sim \mathsf{F}(2n,2n).$ So, for our data with $\bar X_2 > \bar X_1,$ the P-value of an exact 2-sided test is $0.015.$

f = mean(x2)/mean(x1); f
[1] 1.892223
2*(1 - pf(f, 60, 60))
[1] 0.01470998

Best Answer

Related Solutions

Solved – How many « degrees of freedom » should a Wilcoxon rank-sum test have

T-Test – Choosing Between T-Test and Wilcoxon Rank Sum Test for Statistical Analysis

Related Question