SPSS Algorithms state that in doing pairwise comparisons after Friedman test they use the Dunn's (1964) procedure. I didn't read that Dunn's original paper so I can't say if SPSS follows it correctly, - but I've just sat and programmed Friedman's test and its post-hoc pairwise comparisons following the above SPSS algorithms documentation, and I confirm that there is no bug and that my results were identical to what SPSS output and the OP showed in the question. (See my code here).
According to the Dunn's approach (as SPSS carries it out) the test statistic is simply the difference in the mean values of the two samples (variables) being compared, that difference after the values were turned into ranks within cases. (It is the ranks left from Friedman's test computations, that is, ranking of the $k$ [k=3
in our example data] values within each case, with mean rank assignment for ties.) St. error of the statistic is $\sqrt{k(k+1)/(6n)}$. It divides the test statistic to yield standardized statistic $Z$ which is plugged in st. normal distribution to give the (Bonferroni yet uncorrected) 2-sided significance.
This comparison test looks very conservative. It failed to praise the pair V1-V2
as significant: Z=1.838, p=.066
despite that the omnibus Friedman is strongly significant: p=.002
. In contrast, Sign test for pair V1-V2
(it will be the same irrespective whether you perform it on the raw values or on the ranks left from Friedman) has Z=3.575, p=.0004
.
One reason the SPSS "Dunn's approach" is quite conservative is its st. error formula accounting for all the $k$, not 2, variables.
Another reason why it is so less powerful than the Sign test is that it bases itself on all the $n$ cases, including those with ties, while Sign test discards cases with ties; and there is many cases with ties in our data. The problem of power in conjunction with treatment of ties in tests such as Sign was observed, for example in this Q/A.
I took V1
and V2
and, for cases with ties, untied them in a random fashion (by adding negative or positive noise), and computed Sign test (now based on all $n$ cases of course). 500 such trials gave me mean Z=1.927
, which is now far from Z=3.575
and much closer on the road of conservatism towards the observed Dunn's Z=1.838
.
I feel myself dissatisfied with SPSS' "Dunn's" pairwise comparisons as they are too conservative/weak. We expect that if an omnibus test is significant post hoc tests will confirm it often, if not ever. In our example, even Bonferroni-uncorrected p-value could not support the omnibus conclusion.
Is SPSS at all correct in adopting the "Dunn's approach" (originally proposed for Kruskal-Wallis; see also this Q/A) for Friedman post-hoc testing? I can't say, being hardly an expert in multiple comparisons. I would encourage somebody who knows it to comment or post a really helpful answer on this thread.
P.S. I'm quite aware that, while Friedman test can be seen as an extension of Sign test from 2 to $k$ samples (variables), a pairwise post hoc test after Friedman is not and should not be exactly the Sign test. Neither it would be Wilcoxon paired-samle test. The "Dunn's approach" (if adapted to paired-sample situation) looks plausible post hoc because it compares, without further ranking, the "horizontal" ranks obtained at Friedman and reflecting all the $k$ samples. What bothered me, though, was that the approach appeared overconservative in the example of the post.
Later Addition. To me, Dunn's approach as it is implemented after Friedman's test in SPSS is incorrect. It does not adjust for ties in the same fashion as the parent omnibus test (Friedman) does it. Actually, it does not adjust for the ties at all, while it should. (The issue of ties handling is touched in the current answer above.)
The formula of Friedman's test statistic (explained in SPSS Algorithms) is
$$\chi^2= \frac{[12/(nk(k+1))]\sum^k C^2-3n(k+1)}{1-\Sigma T/[nk(k^2-1)]}$$
The denominator of the formula contains the adjustment for ties. If $k=2$ then quantity $\Sigma T/[nk(k^2-1)]$ is the proportion of cases in which the two variables are equal (tied).
Consider Friedman test performed with our variables V1
and V2
($k=2$). The proportion of cases with ties is 287/400=.7175
and the test statistic is 13.460, df=1
with significance p=.00024
. But the "Dunn's" comparison computed following SPSS formulas will be
Sample1 Sample2 MeanRank1 MeanRank2 TestStat StError Z Sig2side AdjSig
V1 V2 1.54875 1.45125 .0975 .0500 1.9500 .05118 .05118
Nonsignificant. Why? No proper (Friedman style) adjustment for ties was done.
In the presense of only $k=2$ samples in data a correct post hoc pairwise comparison test must give the same result (statistic and p-value) as the omnibus test - it is actually a property which proves that the post hoc test corresponds (is isomorphic) to the parent omnibus test. It is indeed so with Kruskal-Wallis test and Dunn's test - just program it following SPSS Algorithms and test with V1
and V2
as two independent groups, and you'll get same p=.0153
both for KW and for Dunn. But we saw that a similar equivalence is absent in relations between Friedman test and "Dunn's approach" post-Friedman comparison test.
Conclusion. Post hoc multiple comparison test being performed by SPSS (version 22 and earlier) after Friedman's test is defective. Maybe it is correct when there is no ties, but I don't know. The post hoc test does not treat ties the way Friedman does it (while it should). I cannot say anything about the formula of st. error, sqrt[k*(k+1)/(6n)]
, they are using: it was derived from discrete uniform distribution, but they didn't write how; is it correct? Either the "Dunn's test approach" was adapted to Friedman inadequatly by SPSS or Dunn's test cannot be adapted to Friedman at all.
Best Answer
The short answer is that needless multiple pairwise testing will taint your inference. That is, if you conduct multiple pairwise comparisons, the probability of falsely rejecting the NULL of in at least one of these tests increases as the number of pairwise comparisons increases.
This is the multiple testing problem that is typically introduced in a stat class to motivate Analysis of variance. Notice that a pairwise testing procedure assumes that each test is independent of one another. This assumption cannot be true since in comparing 3 groups, the same groups are repeatedly used across tests.
Often the Bonferroni method is discussed as a means of controlling for this distortion of the type 1 error rate, where the desired p-value is multiplied by as many tests as are conducted.
By first conducting a test that is designed to compare multiple parameter estimates, the correct inference can be made (as long as the assumptions for that test hold). If this test rejects the NULL of equality of parameters, then the post hoc methods can be employed to determine which parameter(s) is(are) different.