Solved – Kruskal-Wallis test is not significant but some of the Mann-Whitney comparisons are significant

dunn-testhypothesis testingkruskal-wallis test”statistical significancewilcoxon-mann-whitney-test

I have a non-normalized dependent variable and an independent variable broken down into 4 groups. As such, I used the Kruskal–Wallis analysis to look for significant differences in the ranks of the groups.
The data look like the following:

\begin{array}{clc}\rm Group&\rm Size&\rm Means\\\hline
0&n=24&31.79\\
1&n=13&26.65\\
2&n=8 &15.94\\
3&n=10&30.30\\
\rm Total& N=55\end{array}

I get an asymptotic significance of .103.

However, if I run a Mann–Whitney U on the same data, I see a clear significance between the 0 and 2 groups:

   0    24  18.81   451.50
   2    8   9.56    76.50
Total   32      

Mann-Whitney U  40.500
Wilcoxon W  76.500
Z   -2.416

Asymp. Sig. (2-tailed)  .016
Exact Sig. [2*(1-tailed Sig.)]  .013b

The only things I could come up with were that:

  • I may be running into some kind of confounding error with the Mann–Whitney U test, and thus the significance is a mistake, or that
  • The large difference in between the sample sizes is what is causing me to see significance with the Mann–Whitney, but not the K–W since it has a bit more "power" to tease out what is significant and what isn't.

Any help or guidance would be appreciated.

Best Answer

Perhaps not surprising that you got disparate results since (ahem): the Mann-Whitney rank sum test is not an appropriate post hoc pairwise test for the Kruskal-Wallis omnibus test.

The rank sum test:

  1. Ignores the rankings used to conduct the Kruskal-Wallis test

  2. Does not account for the pooled variance implied by the null hypothesis of the Kruskal-Wallis test.

A popular and appropriate post hoc pairwise test is Dunn's test, which does employ the same rankings as the Kruskal-Wallis, and does account for pooled variance implied by the null hypothesis of the Kruskal-Wallis test. Another less well-known post hoc test is the Conover-Iman test, which is strictly more powerful that Dunn's test (assuming the Kruskal-Wallis test rejects its null hypothesis).

Dunn's test is implemented for Stata in the dunntest package (within Stata type net describe dunntest, from(http://www.alexisdinno.com/stata)), and for R in the dunn.test package. The Conover-Iman test is implemented within Stata in the conovertest package (within Stata type net describe conovertest, from(http://www.alexisdinno.com/stata)), and for R in the conover.test package.

Being concerned that you found disparate results between the Kruskal-Wallis and rank sum tests is a little like being concerned that you found disparate results between ANOVA on transformed data, and unpaired t tests without pooled variance on untransformed data; I think you are asking the wrong question.