Hypothesis Testing – Causes of Exploding Statistics and P-Values Near Zero with Wilcoxon-Mann-Whitney Test

hypothesis testingstatistical significancewilcoxon-mann-whitney-test

I am using the Wilcoxon Rank-Sum/Mann-Whitney U test to compare metrics between two groups. In almost all cases, I get huge values for the statistic and only p-values around zero. In one case I got reasonable values for p around 0.22, which raises the question if the other cases are actually around zero. The number of values in the distributions I compare ranges between 200k and 800k depending on the metric and can differ between the two groups.

The null hypothesis is that both groups are from the same distribution.

  • Is it possible to get p-values around zero with these tests and what does that mean?
  • Is there a cause which could explain this behavior?

Please comment if you need more information on my setup.

Edit 1:

The statistic values seem to be reasonable as they seem to be around the expected mean of 1/2 the product of the size of the two groups.

Edit 2:

It seems that I have very few very high outliers but they should not be an issue with the chosen test as described here: Do we need to worry about outliers when using rank-based tests?.

Best Answer

Nothing out of the ordinary is going on from the sound of it.

In almost all cases, I get huge values for the statistic

Have you looked at the range of possible values for the statistic?

For the usual form of the U-statistic, it can take values between $0$ and $mn$ where $m$ and $n$ are the two sample sizes.

If you divide the statistic by $mn$, you get $\frac{U}{mn}$, which is the proportion (rather than the count) of cases in which a value from one sample exceeds a value from the other, which takes values between $0$ and $1$. The null case corresponds to an expected proportion of $\frac12$ (with standard error $\sqrt{\frac{m+n+1}{12mn}}$).

Alternatively, you could look at a z-score, which you may find somewhat more intuitive than the raw test statistic.

Is it possible to get p-values around zero with these tests and what does that mean?

Certainly. Unless your sample sizes are very small, extremely small p-values are possible.

For a one-tailed test the p-value may be as small as $\frac{m!\, n!}{(m+n)!}$ and twice that for a two-tailed test. For example with small sample sizes of $m=n=10$, you could see a two-tailed p-value as small as $1/92378$ or about $0.000011$ and the smallest available p-values decrease very rapidly as sample sizes increase. Doubling both sample sizes to $m=n=20$ reduces the smallest possible p-value by a factor of about $746000$, to $1.45\times 10^{-11}$.

Is there a cause which could explain this behavior?

A small to moderate effect size with large samples or a large effect size with smaller samples can both do it.