You should use a proper post hoc pairwise test like Dunn's test.*
If one proceeds by moving from a rejection of Kruskal-Wallis to performing ordinary pair-wise rank sum tests (with or without multiple comparison adjustments), one runs into two problems:
the ranks that the pair-wise rank sum tests use are not the ranks used by the Kruskal-Wallis test (i.e. you are, in effect, pretending to perform post hoc tests, but are actually using different data than was used in the Kruskal-Wallis test to do so); and
Dunn's test preserves a pooled variance for the tests implied by the Kruskal-Wallis null hypothesis.
Of course, as with any omnibus test (e.g., ANOVA, Cochran's $Q$, etc.), post hoc tests following rejection of a Kruskal-Wallis test which have been adjusted for multiple comparisons may fail to reject all pairwise tests for a given family-wise error rate or given false discovery rate corresponding to a given $\alpha$ for the omnibus test.
* Dunn's test is implemented in Stata in the dunntest package (within Stata type net describe dunntest, from(https://alexisdinno.com/stata)
), and in R in the dunn.test package. Caveat: there are a few less well-known post hoc pair-wise tests to follow a rejected Kruskal-Wallis, including Conover-Iman (like Dunn, but based on the t distribution, rather than the z distribution, and strictly more powerful as a post hoc test) which is implemented for Stata in the conovertest package (within Stata type net describe conovertest, from(https://alexisdinno.com/stata)
), and for R in the conover.test package, and the Dwass-Steel-Critchlow-Fligner tests.
SPSS Algorithms state that in doing pairwise comparisons after Friedman test they use the Dunn's (1964) procedure. I didn't read that Dunn's original paper so I can't say if SPSS follows it correctly, - but I've just sat and programmed Friedman's test and its post-hoc pairwise comparisons following the above SPSS algorithms documentation, and I confirm that there is no bug and that my results were identical to what SPSS output and the OP showed in the question. (See my code here).
According to the Dunn's approach (as SPSS carries it out) the test statistic is simply the difference in the mean values of the two samples (variables) being compared, that difference after the values were turned into ranks within cases. (It is the ranks left from Friedman's test computations, that is, ranking of the $k$ [k=3
in our example data] values within each case, with mean rank assignment for ties.) St. error of the statistic is $\sqrt{k(k+1)/(6n)}$. It divides the test statistic to yield standardized statistic $Z$ which is plugged in st. normal distribution to give the (Bonferroni yet uncorrected) 2-sided significance.
This comparison test looks very conservative. It failed to praise the pair V1-V2
as significant: Z=1.838, p=.066
despite that the omnibus Friedman is strongly significant: p=.002
. In contrast, Sign test for pair V1-V2
(it will be the same irrespective whether you perform it on the raw values or on the ranks left from Friedman) has Z=3.575, p=.0004
.
One reason the SPSS "Dunn's approach" is quite conservative is its st. error formula accounting for all the $k$, not 2, variables.
Another reason why it is so less powerful than the Sign test is that it bases itself on all the $n$ cases, including those with ties, while Sign test discards cases with ties; and there is many cases with ties in our data. The problem of power in conjunction with treatment of ties in tests such as Sign was observed, for example in this Q/A.
I took V1
and V2
and, for cases with ties, untied them in a random fashion (by adding negative or positive noise), and computed Sign test (now based on all $n$ cases of course). 500 such trials gave me mean Z=1.927
, which is now far from Z=3.575
and much closer on the road of conservatism towards the observed Dunn's Z=1.838
.
I feel myself dissatisfied with SPSS' "Dunn's" pairwise comparisons as they are too conservative/weak. We expect that if an omnibus test is significant post hoc tests will confirm it often, if not ever. In our example, even Bonferroni-uncorrected p-value could not support the omnibus conclusion.
Is SPSS at all correct in adopting the "Dunn's approach" (originally proposed for Kruskal-Wallis; see also this Q/A) for Friedman post-hoc testing? I can't say, being hardly an expert in multiple comparisons. I would encourage somebody who knows it to comment or post a really helpful answer on this thread.
P.S. I'm quite aware that, while Friedman test can be seen as an extension of Sign test from 2 to $k$ samples (variables), a pairwise post hoc test after Friedman is not and should not be exactly the Sign test. Neither it would be Wilcoxon paired-samle test. The "Dunn's approach" (if adapted to paired-sample situation) looks plausible post hoc because it compares, without further ranking, the "horizontal" ranks obtained at Friedman and reflecting all the $k$ samples. What bothered me, though, was that the approach appeared overconservative in the example of the post.
Later Addition. To me, Dunn's approach as it is implemented after Friedman's test in SPSS is incorrect. It does not adjust for ties in the same fashion as the parent omnibus test (Friedman) does it. Actually, it does not adjust for the ties at all, while it should. (The issue of ties handling is touched in the current answer above.)
The formula of Friedman's test statistic (explained in SPSS Algorithms) is
$$\chi^2= \frac{[12/(nk(k+1))]\sum^k C^2-3n(k+1)}{1-\Sigma T/[nk(k^2-1)]}$$
The denominator of the formula contains the adjustment for ties. If $k=2$ then quantity $\Sigma T/[nk(k^2-1)]$ is the proportion of cases in which the two variables are equal (tied).
Consider Friedman test performed with our variables V1
and V2
($k=2$). The proportion of cases with ties is 287/400=.7175
and the test statistic is 13.460, df=1
with significance p=.00024
. But the "Dunn's" comparison computed following SPSS formulas will be
Sample1 Sample2 MeanRank1 MeanRank2 TestStat StError Z Sig2side AdjSig
V1 V2 1.54875 1.45125 .0975 .0500 1.9500 .05118 .05118
Nonsignificant. Why? No proper (Friedman style) adjustment for ties was done.
In the presense of only $k=2$ samples in data a correct post hoc pairwise comparison test must give the same result (statistic and p-value) as the omnibus test - it is actually a property which proves that the post hoc test corresponds (is isomorphic) to the parent omnibus test. It is indeed so with Kruskal-Wallis test and Dunn's test - just program it following SPSS Algorithms and test with V1
and V2
as two independent groups, and you'll get same p=.0153
both for KW and for Dunn. But we saw that a similar equivalence is absent in relations between Friedman test and "Dunn's approach" post-Friedman comparison test.
Conclusion. Post hoc multiple comparison test being performed by SPSS (version 22 and earlier) after Friedman's test is defective. Maybe it is correct when there is no ties, but I don't know. The post hoc test does not treat ties the way Friedman does it (while it should). I cannot say anything about the formula of st. error, sqrt[k*(k+1)/(6n)]
, they are using: it was derived from discrete uniform distribution, but they didn't write how; is it correct? Either the "Dunn's test approach" was adapted to Friedman inadequatly by SPSS or Dunn's test cannot be adapted to Friedman at all.
Best Answer
You are correct not to be surprised that the two methods give slightly different results near the 5% level. Depending on how many comparisons you made, Bonferroni-based ad hoc comparisons might be over-conservative (i.e., too "reluctant" to declare differences).
I would simply say that the main K-W test is barely significant, and give results for the most interesting ad hoc comparison(s). If there are two levels clearly with the largest difference in medians (or other quartiles) or with clearly different boxplots, then IMHO it would be OK to say that those two levels might be considered somehow different.
Of course I can't give a detailed answer without access your data and outputs from your SPSS analyses. However, consider the fictitious Likert=5 data below, for which the K-W test just barely shows overall differences (5% level) in the three locations.
Using R, I don't suppose I can do exactly the same ad hoc test you did in SPSS. However, a Wilcoxon RS test comparing levels 1 and 3 shows no significant difference at the (unadjusted) 5% level.
However, the boxplots for levels 1 and 3 look very different (even though the medians are the same), with many values in level 3 higher than values in level 1.
Moreover, empirical CDF (ECDF) plots of these two levels show that the ECDF for level 3 lies mostly to the right of (thus below) the ECDF for level 1, suggesting stochastic dominance of level 3.
Especially if a difference between levels 1 and 3 had importance for the project at hand, I would not hesitate to mention the apparent differences between these two levels---stopping short of claiming significance.
Note: I have framed this answer in terms of 'significance' because I took that to be the point of your question. But @FrankHarrell has a point about not viewing significance with 'reverence'. The main hypothesis with P-value just barely below 5% is only weakly suggestive of differences. Then by 'logic' it would follow that some two of the three levels may be different, and 1 vs. 3 seems the best candidate. But ad hoc testing is not compelled to follow that 'logic'.
If you change the seed in the code that sampled my fictitious data, you may get fictitious data for which the K-W test is is not significant at the 5% level: 72 replications per level do not provide good power for the K-W test. (In fact,
set.seed(726)
leads to P-value about 18%.)