Hypothesis Testing – How to Understand the Mann-Whitney $U$ Test: Distribution Equality or Mean/Median Equality?

hypothesis testingmathematical-statisticsnonparametricstatistical significancewilcoxon-mann-whitney-test

I am rather confused about the Mann Whitney test, many statements I read state it tests for distribution equality between two populations and some state it tests for means/median/central tendency only. I ran some simple tests and it shows it only tests for central tendency, not shape. Many books state distribution equality (pdf), why? Can you please explain.

Distribution equality statements

  • Sheldon Ross' book
    Suppose that one is considering two different methods of production in determining whether the two methods result in statistically identical items. To attack this problem let X1,…,Xn, Y1,…,Ym denote samples of the measurable values of items by method 1 and method 2. If we let F and G, both assumed to be continuous, denote the distribution functions of the two samples, respectively, then the hypothesis we wish to test is H0:F=G. One procedure for testing H0 is the Mann-Whitney test.

  • Some Caltech notes Now suppose we have two samples. We want to know whether they could have been drawn from the same population, or from different populations, and, if the latter, whether they differ in some predicted direction. Again assume we know nothing about probability distributions, so that we need non-parametric tests. Mann-Whitney (Wilcoxon) U test. There are two samples, A (m members) and B (n members); H0 is that A and B are from the same distribution or have the same parent population.

  • Wikipedia This test can be used to investigate whether two independent samples were selected from populations having the same distribution.

  • Nonparametric
    Statistical Tests
    The null hypothesis is H0: θ = 0; that is, there is no difference at all
    between the distribution functions F and G.

But when I use F=N(0,10) and G=U(-3,3) to test, the p-value is very high. They can't be more different except E(F)=E(G) and symmetric.

—–Mean/median equality statements——-

  • ArticleThe Mann–Whitney U-test can be used when the aim is to show a difference between two groups in the value of an ordinal, interval or ratio variable. It is the non-parametric version of the t-test.
  • Test results
#octave
pkg load statistics #import octave statistics package
x = normrnd(0, 1, [1,100]); #100 N(0,1)
y1 = normrnd(0, 3, [1,100]); #100 N(0,3)
y2 = normrnd(0, 20, [1, 100]); #100 N(0,20)
y3 = unifrnd(-5, 5, [1,100]); #100 U(-5,5)
[p, ks] = kolmogorov_smirnov_test(y1, "norm", 0, 1) #KS test if y1==N(0,1)
p = 0.000002; #y of N(0,3) not equal to N(0,1)
[p, z] = u_test(x, y1); #Mann-Whitney of x~N(0,1) vs y~N(0,3)
p = 0.52; #null accepted 
[p, z] = u_test(x, y2); #Mann-Whitney of x~N(0,1) vs y~N(0,20)
p = 0.32; #null accepted
[p, z] u_test(x, y3); #Mann-Whitney of x~N(0,1) vs y~U(-5,5)
p = 0.15; #null accepted
#Apparently, Mann-Whitney doesn't test pdf equality

——-Confusing———

  • Nonparametric Statistical Methods, 3rd Edition I don't understand how its H0: E(Y)-E(X) = 0 = no-shift, can be deduced from (4.2) which seems to suggest pdf equality (equal higher moments) except the shift.
  • Article The test can detect differences in shape and spread as well as just differences in medians. Differences in population medians are often accompanied by equally important differences in shape. really??how??…confused.

After-thoughts

It seems many notes teach MW in a duck-typing way in which MW is introduced as a duck because if we only focus on key behaviours of a duck (quack=pdf, swim=shape), MW does appear like a duck (location-shift test). Most of the times, a duck and donald duck don't behave too markedly different, so such a MW description seems fine and easy to understand; but when donald duck dominates a duck whilst still quacking like a duck, MW can show significance, baffling unsuspecting students. It is not the students' fault, but a pedagogical mistake by claiming donald duck is a duck without clarifying he can be un-duck at times.

Also, my feeling is that in parametric hypothesis testing, tests are introduced with their purpose framed in $H_0$, making the $H_1$ implicit. Many authors move on to nonparametric testing without first highlighting differences in getting the test-statistics probabilities (permutating X Y samples under $H_0$), so students continue to differentiate tests by looking at $H_0$.

Like we are taught to use t-test for $H_0:\mu_x = k $ or $H_0: \mu_x = \mu_y$ and F-test for $H_0: \sigma_x^2 = \sigma_y^2$, with $H_1: \mu_x \ne \mu_y$ and $H_1: \sigma_x^2 \ne \sigma_y^2 $ implicit; on the other hand, we need to be explicit about what we test in $H_1$ as $H_0: F=G$ is trivially true for all tests of a permutation nature. So when instead of seeing $H_0: F=G$ and automatically thinking of $H_1: F \ne G$ so it is a K-S test, we should rather pay attention to the $H_1$ in deciding what's under analysis ($F\ne G, F>G $) and pick a test (KS, MW) accordingly.

Best Answer

It is informative to see exactly what the Mann-Whitney test does. For two samples $X = \{x_1, \dots, x_m \}$ and $Y=\{y_1, \dots, y_n\}$, under the assumptions that

  • Observations in $X$ are iid
  • Observations in $Y$ are iid
  • The samples $X$ and $Y$ are mutually independent.
  • The respective populations from which $X$ and $Y$ were sampled are continuous.

then, the U statistic is defined as:

$$ U = \sum_{i=1}^m \sum_{j=1}^n bool(x_i < y_j )$$

It should be reasonably intuitive to see that if X and Y represent the same distributions (i.e. the null hypothesis), then the expected value of $U$ would $mn/2$, since you could expect values below a certain rank to occur as often for $X$ as for $Y$. So you can think of the Mann Whitney test as checking to what extent the statistic $U$ deviates from this expected value.

If this intuition isn't clear, then think of the first rank (i.e. the leftmost rarest value in each sample). If $X$ and $Y$ were drawn from the same distribution, you would have no reason to expect that the rarest value in $X$ would be less than $Y$ more than 50% of the time, otherwise this would make you think that actually $X$ has a heavier tail than $Y$. You can extend this logic for the 2nd rarest value, 3rd, and so forth.

Similarly, if you drew the same number of observations, say $K$, you could almost think of the ranks as $K$ "common bins" with fuzzy boundaries. If $X$ and $Y$ came from the same population, you might expect each rank to occupy roughly the same space, and there's no reason to think that the $x_k $observation in that bin would be to the right of $y_k$ more than 50% of the time.

However, if $x_k$ at a particular "bin" $k$ was to the right of $y_k$ more often than not, then this denotes that there is a systematic "shift". This is what makes Mann-Whitney a good test for detecting 'shift' in distributions that are assumed to be relatively similar except for a possible shift due to a treatment effect.

Now consider the $X \sim \mathcal N(0,1)$ vs $Y \sim \mathcal N(0,2)$ scenario. Assume $K=1000$ samples in each case. You would expect that for the most part, given the same rank, negative values in Y, would tend to be to the left of X more or less all the time. Whereas, positive values in Y, would tend to be to the right of X more or less all the time. Therefore in this particular scenario, even though the distributions are completely different, it happens that half the time X is less likely to be larger than Y, and half the time it is more likely. Therefore you'd expect the U statistic to be very close to the expected value $K^2/2$, and therefore unlikely to be significant.

In other words, it may be a reasonable test to compare two samples in a general "goodness of fit" sense in some specific circumstances, but it is important to be familiar with the situations where it would not. The example above is one such case.