Post-Hoc – How to Report Significant Difference for Kruskal-Wallis Test but Not for Bonferroni Pairwise Comparison

bonferronidunn-testkruskal-wallis test”post-hoc

I performed Krusal-Wallis test in SPSS and got a borderline significant P (0.042). then, the post hoc pairwise comparison with Bonferroni-Dunn correction showed no pairwise significant difference. I am not surprise that these tests were not the same.

My question is how should I conclude based on these test results?

Best Answer

You are correct not to be surprised that the two methods give slightly different results near the 5% level. Depending on how many comparisons you made, Bonferroni-based ad hoc comparisons might be over-conservative (i.e., too "reluctant" to declare differences).

I would simply say that the main K-W test is barely significant, and give results for the most interesting ad hoc comparison(s). If there are two levels clearly with the largest difference in medians (or other quartiles) or with clearly different boxplots, then IMHO it would be OK to say that those two levels might be considered somehow different.


Of course I can't give a detailed answer without access your data and outputs from your SPSS analyses. However, consider the fictitious Likert=5 data below, for which the K-W test just barely shows overall differences (5% level) in the three locations.

set.seed(927)
x1 = sample(1:5, 72, rep=T, p = c(1,1,1,2,3))
x2 = sample(1:5, 72, rep=T, p = c(0,1,2,3,3))
x3 = sample(1:5, 72, rep=T, p = c(0,1,1,3,4))
x = c(x1,x2,x3); g = rep(1:3, each=72)

boxplot(x ~ g, horizontal=T, col="skyblue2")

enter image description here

kruskal.test(x~ g)

        Kruskal-Wallis rank sum test

data:  x by g
Kruskal-Wallis chi-squared = 6.1487, df = 2, p-value = 0.04622

Using R, I don't suppose I can do exactly the same ad hoc test you did in SPSS. However, a Wilcoxon RS test comparing levels 1 and 3 shows no significant difference at the (unadjusted) 5% level.

wilcox.test(x1,x3)$p.val
[1] 0.05116496

However, the boxplots for levels 1 and 3 look very different (even though the medians are the same), with many values in level 3 higher than values in level 1.

summary(x1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.000   4.000   3.667   5.000   5.000 
summary(x3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   4.000   4.000   4.278   5.000   5.000 

Moreover, empirical CDF (ECDF) plots of these two levels show that the ECDF for level 3 lies mostly to the right of (thus below) the ECDF for level 1, suggesting stochastic dominance of level 3.

plot(ecdf(x1), col="blue", lwd=2, 
     main="ECDFs for Levels 1 (blue) and 3")
 lines(ecdf(x3), col="brown", lty="dotted", pch="o")

enter image description here

Especially if a difference between levels 1 and 3 had importance for the project at hand, I would not hesitate to mention the apparent differences between these two levels---stopping short of claiming significance.

Note: I have framed this answer in terms of 'significance' because I took that to be the point of your question. But @FrankHarrell has a point about not viewing significance with 'reverence'. The main hypothesis with P-value just barely below 5% is only weakly suggestive of differences. Then by 'logic' it would follow that some two of the three levels may be different, and 1 vs. 3 seems the best candidate. But ad hoc testing is not compelled to follow that 'logic'.

If you change the seed in the code that sampled my fictitious data, you may get fictitious data for which the K-W test is is not significant at the 5% level: 72 replications per level do not provide good power for the K-W test. (In fact, set.seed(726) leads to P-value about 18%.)