Post-Hoc – How to Report Significant Difference for Kruskal-Wallis Test but Not for Bonferroni Pairwise Comparison

bonferronidunn-testkruskal-wallis test”post-hoc

I performed Krusal-Wallis test in SPSS and got a borderline significant P (0.042). then, the post hoc pairwise comparison with Bonferroni-Dunn correction showed no pairwise significant difference. I am not surprise that these tests were not the same.

My question is how should I conclude based on these test results?

Best Answer

You are correct not to be surprised that the two methods give slightly different results near the 5% level. Depending on how many comparisons you made, Bonferroni-based ad hoc comparisons might be over-conservative (i.e., too "reluctant" to declare differences).

I would simply say that the main K-W test is barely significant, and give results for the most interesting ad hoc comparison(s). If there are two levels clearly with the largest difference in medians (or other quartiles) or with clearly different boxplots, then IMHO it would be OK to say that those two levels might be considered somehow different.

Of course I can't give a detailed answer without access your data and outputs from your SPSS analyses. However, consider the fictitious Likert=5 data below, for which the K-W test just barely shows overall differences (5% level) in the three locations.

set.seed(927)
x1 = sample(1:5, 72, rep=T, p = c(1,1,1,2,3))
x2 = sample(1:5, 72, rep=T, p = c(0,1,2,3,3))
x3 = sample(1:5, 72, rep=T, p = c(0,1,1,3,4))
x = c(x1,x2,x3); g = rep(1:3, each=72)

boxplot(x ~ g, horizontal=T, col="skyblue2")

kruskal.test(x~ g)

        Kruskal-Wallis rank sum test

data:  x by g
Kruskal-Wallis chi-squared = 6.1487, df = 2, p-value = 0.04622

Using R, I don't suppose I can do exactly the same ad hoc test you did in SPSS. However, a Wilcoxon RS test comparing levels 1 and 3 shows no significant difference at the (unadjusted) 5% level.

wilcox.test(x1,x3)$p.val
[1] 0.05116496

However, the boxplots for levels 1 and 3 look very different (even though the medians are the same), with many values in level 3 higher than values in level 1.

summary(x1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.000   4.000   3.667   5.000   5.000 
summary(x3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   4.000   4.000   4.278   5.000   5.000

Moreover, empirical CDF (ECDF) plots of these two levels show that the ECDF for level 3 lies mostly to the right of (thus below) the ECDF for level 1, suggesting stochastic dominance of level 3.

plot(ecdf(x1), col="blue", lwd=2, 
     main="ECDFs for Levels 1 (blue) and 3")
 lines(ecdf(x3), col="brown", lty="dotted", pch="o")

Especially if a difference between levels 1 and 3 had importance for the project at hand, I would not hesitate to mention the apparent differences between these two levels---stopping short of claiming significance.

Note: I have framed this answer in terms of 'significance' because I took that to be the point of your question. But @FrankHarrell has a point about not viewing significance with 'reverence'. The main hypothesis with P-value just barely below 5% is only weakly suggestive of differences. Then by 'logic' it would follow that some two of the three levels may be different, and 1 vs. 3 seems the best candidate. But ad hoc testing is not compelled to follow that 'logic'.

If you change the seed in the code that sampled my fictitious data, you may get fictitious data for which the K-W test is is not significant at the 5% level: 72 replications per level do not provide good power for the K-W test. (In fact, set.seed(726) leads to P-value about 18%.)

Related Solutions

Solved – Post-hoc tests after Kruskal-Wallis: Dunn’s test or Bonferroni corrected Mann-Whitney tests

You should use a proper post hoc pairwise test like Dunn's test.^*

If one proceeds by moving from a rejection of Kruskal-Wallis to performing ordinary pair-wise rank sum tests (with or without multiple comparison adjustments), one runs into two problems:

the ranks that the pair-wise rank sum tests use are not the ranks used by the Kruskal-Wallis test (i.e. you are, in effect, pretending to perform post hoc tests, but are actually using different data than was used in the Kruskal-Wallis test to do so); and
Dunn's test preserves a pooled variance for the tests implied by the Kruskal-Wallis null hypothesis.

Of course, as with any omnibus test (e.g., ANOVA, Cochran's $Q$, etc.), post hoc tests following rejection of a Kruskal-Wallis test which have been adjusted for multiple comparisons may fail to reject all pairwise tests for a given family-wise error rate or given false discovery rate corresponding to a given $\alpha$ for the omnibus test.

^* Dunn's test is implemented in Stata in the dunntest package (within Stata type net describe dunntest, from(https://alexisdinno.com/stata)), and in R in the dunn.test package. Caveat: there are a few less well-known post hoc pair-wise tests to follow a rejected Kruskal-Wallis, including Conover-Iman (like Dunn, but based on the t distribution, rather than the z distribution, and strictly more powerful as a post hoc test) which is implemented for Stata in the conovertest package (within Stata type net describe conovertest, from(https://alexisdinno.com/stata)), and for R in the conover.test package, and the Dwass-Steel-Critchlow-Fligner tests.

Friedman Test – Significant Results but Insignificant Post Hoc Comparisons (SPSS)

SPSS Algorithms state that in doing pairwise comparisons after Friedman test they use the Dunn's (1964) procedure. I didn't read that Dunn's original paper so I can't say if SPSS follows it correctly, - but I've just sat and programmed Friedman's test and its post-hoc pairwise comparisons following the above SPSS algorithms documentation, and I confirm that there is no bug and that my results were identical to what SPSS output and the OP showed in the question. (See my code here).

According to the Dunn's approach (as SPSS carries it out) the test statistic is simply the difference in the mean values of the two samples (variables) being compared, that difference after the values were turned into ranks within cases. (It is the ranks left from Friedman's test computations, that is, ranking of the $k$ [k=3 in our example data] values within each case, with mean rank assignment for ties.) St. error of the statistic is $\sqrt{k(k+1)/(6n)}$. It divides the test statistic to yield standardized statistic $Z$ which is plugged in st. normal distribution to give the (Bonferroni yet uncorrected) 2-sided significance.

This comparison test looks very conservative. It failed to praise the pair V1-V2 as significant: Z=1.838, p=.066 despite that the omnibus Friedman is strongly significant: p=.002. In contrast, Sign test for pair V1-V2 (it will be the same irrespective whether you perform it on the raw values or on the ranks left from Friedman) has Z=3.575, p=.0004.

One reason the SPSS "Dunn's approach" is quite conservative is its st. error formula accounting for all the $k$, not 2, variables.

Another reason why it is so less powerful than the Sign test is that it bases itself on all the $n$ cases, including those with ties, while Sign test discards cases with ties; and there is many cases with ties in our data. The problem of power in conjunction with treatment of ties in tests such as Sign was observed, for example in this Q/A.

I took V1 and V2 and, for cases with ties, untied them in a random fashion (by adding negative or positive noise), and computed Sign test (now based on all $n$ cases of course). 500 such trials gave me mean Z=1.927, which is now far from Z=3.575 and much closer on the road of conservatism towards the observed Dunn's Z=1.838.

I feel myself dissatisfied with SPSS' "Dunn's" pairwise comparisons as they are too conservative/weak. We expect that if an omnibus test is significant post hoc tests will confirm it often, if not ever. In our example, even Bonferroni-uncorrected p-value could not support the omnibus conclusion.

Is SPSS at all correct in adopting the "Dunn's approach" (originally proposed for Kruskal-Wallis; see also this Q/A) for Friedman post-hoc testing? I can't say, being hardly an expert in multiple comparisons. I would encourage somebody who knows it to comment or post a really helpful answer on this thread.

P.S. I'm quite aware that, while Friedman test can be seen as an extension of Sign test from 2 to $k$ samples (variables), a pairwise post hoc test after Friedman is not and should not be exactly the Sign test. Neither it would be Wilcoxon paired-samle test. The "Dunn's approach" (if adapted to paired-sample situation) looks plausible post hoc because it compares, without further ranking, the "horizontal" ranks obtained at Friedman and reflecting all the $k$ samples. What bothered me, though, was that the approach appeared overconservative in the example of the post.

Later Addition. To me, Dunn's approach as it is implemented after Friedman's test in SPSS is incorrect. It does not adjust for ties in the same fashion as the parent omnibus test (Friedman) does it. Actually, it does not adjust for the ties at all, while it should. (The issue of ties handling is touched in the current answer above.)

The formula of Friedman's test statistic (explained in SPSS Algorithms) is $$\chi^2= \frac{[12/(nk(k+1))]\sum^k C^2-3n(k+1)}{1-\Sigma T/[nk(k^2-1)]}$$

The denominator of the formula contains the adjustment for ties. If $k=2$ then quantity $\Sigma T/[nk(k^2-1)]$ is the proportion of cases in which the two variables are equal (tied).

Consider Friedman test performed with our variables V1 and V2 ($k=2$). The proportion of cases with ties is 287/400=.7175 and the test statistic is 13.460, df=1 with significance p=.00024. But the "Dunn's" comparison computed following SPSS formulas will be

Sample1  Sample2  MeanRank1 MeanRank2 TestStat  StError   Z    Sig2side  AdjSig
  V1       V2      1.54875   1.45125   .0975     .0500  1.9500  .05118  .05118

Nonsignificant. Why? No proper (Friedman style) adjustment for ties was done.

In the presense of only $k=2$ samples in data a correct post hoc pairwise comparison test must give the same result (statistic and p-value) as the omnibus test - it is actually a property which proves that the post hoc test corresponds (is isomorphic) to the parent omnibus test. It is indeed so with Kruskal-Wallis test and Dunn's test - just program it following SPSS Algorithms and test with V1 and V2 as two independent groups, and you'll get same p=.0153 both for KW and for Dunn. But we saw that a similar equivalence is absent in relations between Friedman test and "Dunn's approach" post-Friedman comparison test.

Conclusion. Post hoc multiple comparison test being performed by SPSS (version 22 and earlier) after Friedman's test is defective. Maybe it is correct when there is no ties, but I don't know. The post hoc test does not treat ties the way Friedman does it (while it should). I cannot say anything about the formula of st. error, sqrt[k*(k+1)/(6n)], they are using: it was derived from discrete uniform distribution, but they didn't write how; is it correct? Either the "Dunn's test approach" was adapted to Friedman inadequatly by SPSS or Dunn's test cannot be adapted to Friedman at all.

Best Answer

Related Solutions

Solved – Post-hoc tests after Kruskal-Wallis: Dunn’s test or Bonferroni corrected Mann-Whitney tests

Friedman Test – Significant Results but Insignificant Post Hoc Comparisons (SPSS)

Related Question