Hypothesis Testing – Causes of Exploding Statistics and P-Values Near Zero with Wilcoxon-Mann-Whitney Test

hypothesis testingstatistical significancewilcoxon-mann-whitney-test

I am using the Wilcoxon Rank-Sum/Mann-Whitney U test to compare metrics between two groups. In almost all cases, I get huge values for the statistic and only p-values around zero. In one case I got reasonable values for p around 0.22, which raises the question if the other cases are actually around zero. The number of values in the distributions I compare ranges between 200k and 800k depending on the metric and can differ between the two groups.

The null hypothesis is that both groups are from the same distribution.

Is it possible to get p-values around zero with these tests and what does that mean?
Is there a cause which could explain this behavior?

Please comment if you need more information on my setup.

Edit 1:

The statistic values seem to be reasonable as they seem to be around the expected mean of 1/2 the product of the size of the two groups.

Edit 2:

It seems that I have very few very high outliers but they should not be an issue with the chosen test as described here: Do we need to worry about outliers when using rank-based tests?.

Best Answer

Nothing out of the ordinary is going on from the sound of it.

In almost all cases, I get huge values for the statistic

Have you looked at the range of possible values for the statistic?

For the usual form of the U-statistic, it can take values between $0$ and $mn$ where $m$ and $n$ are the two sample sizes.

If you divide the statistic by $mn$, you get $\frac{U}{mn}$, which is the proportion (rather than the count) of cases in which a value from one sample exceeds a value from the other, which takes values between $0$ and $1$. The null case corresponds to an expected proportion of $\frac12$ (with standard error $\sqrt{\frac{m+n+1}{12mn}}$).

Alternatively, you could look at a z-score, which you may find somewhat more intuitive than the raw test statistic.

Is it possible to get p-values around zero with these tests and what does that mean?

Certainly. Unless your sample sizes are very small, extremely small p-values are possible.

For a one-tailed test the p-value may be as small as $\frac{m!\, n!}{(m+n)!}$ and twice that for a two-tailed test. For example with small sample sizes of $m=n=10$, you could see a two-tailed p-value as small as $1/92378$ or about $0.000011$ and the smallest available p-values decrease very rapidly as sample sizes increase. Doubling both sample sizes to $m=n=20$ reduces the smallest possible p-value by a factor of about $746000$, to $1.45\times 10^{-11}$.

Is there a cause which could explain this behavior?

A small to moderate effect size with large samples or a large effect size with smaller samples can both do it.

Best Answer

Related Solutions

Solved – Large difference between Mann-Whitney test and Wilcoxon signed rank test significance

Solved – Mann-Whitney U test with very large sample size

Related Question