Solved – Appropriate non-parametric post-hoc test for baseline comparisons

bonferronimachine learningpost-hocwilcoxon-signed-rank

I want to evaluate several "classifiers" (machine-learning algorithms) with paired samples. I do not want to compare each algorithms' performance to every other (n x m comparison) but only compare the performance of each algorithm to one baseline (n x 1 comparison).

An often quoted paper in the field [1] uses the Friedman test for omnibus testing and suggests the following for post-hoc tests:

When all classifiers are compared with a control classifier, we can
instead of the Nemenyi test use one of the general procedures for
controlling the family-wise error in multiple hypothesis test-ing,
such as the Bonferroni correction or similar procedures.

Can I thus use any test for the comparison of 2 groups with paired samples and apply a bonferroni (or less conservative) correction to the p-values? Is a Wilcoxon Signed-Rank test appropriate here?

[1] Demsar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7, 1–30.

Best Answer

You can use Friedman test (or the correction named Iman-Davenport) to the the ranking of the methods. Then for the post-hoc procedure, for a 1xn comparison, you use a post-hoc with a control method.

Nemenyi is valid but not reccommended because it is a very conservative procedure and many of the obvious differences may not be detected. It's better to use procedures more powerful, such us holm or hochberg. The most powerful method are li and finner [1].

[1] S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental Analysis of Power. Information Sciences 180 (2010) 2044–2064. doi:10.1016/j.ins.2009.12.010

Related Solutions

Solved – How to correctly apply the Nemenyi post-hoc test after the Friedman test

I also just started to look at this question.

As mentioned before, when we use the normal distribution to calculate p-values for each test, then these p-values do not take multiple testing into account. To correct for it and control the family-wise error rate, we need some adjustments. Bonferonni, i.e. dividing the significance level or multiplying the raw p-values by the number of tests, is only one possible correction. There are a large number of other multiple testing p-value corrections that are in many cases less conservative.

These p-value corrections do not take the specific structure of the hypothesis tests into account.

I am more familiar with the pairwise comparison of the original data instead of the rank transformed data as in Kruskal-Wallis or Friedman tests. In that case, which is the Tukey HSD test, the test statistic for the multiple comparison is distributed according to the studentized range distribution, which is the distribution for all pairwise comparisons under the assumption of independent samples. It is based on probabilities of multivariate normal distribution which could be calculated by numerical integration but are usually used from tables.

My guess, since I don't know the theory, is that the studentized range distribution can be applied to the case of rank tests in a similar way as in the Tukey HSD pairwise comparisons.

So, using (2) normal distribution plus multiple testing p-value corrections and using (1) studentized range distributions are two different ways of getting an approximate distribution of the test statistics. However, if the assumptions for the use of the studentized range distribution are satisfied, then it should provide a better approximation since it is designed for the specific problem of all pairwise comparisons.

Solved – Friedman’s test is very significant, but its post hoc comparisons (SPSS) are not significant

SPSS Algorithms state that in doing pairwise comparisons after Friedman test they use the Dunn's (1964) procedure. I didn't read that Dunn's original paper so I can't say if SPSS follows it correctly, - but I've just sat and programmed Friedman's test and its post-hoc pairwise comparisons following the above SPSS algorithms documentation, and I confirm that there is no bug and that my results were identical to what SPSS output and the OP showed in the question. (See my code here).

According to the Dunn's approach (as SPSS carries it out) the test statistic is simply the difference in the mean values of the two samples (variables) being compared, that difference after the values were turned into ranks within cases. (It is the ranks left from Friedman's test computations, that is, ranking of the $k$ [k=3 in our example data] values within each case, with mean rank assignment for ties.) St. error of the statistic is $\sqrt{k(k+1)/(6n)}$. It divides the test statistic to yield standardized statistic $Z$ which is plugged in st. normal distribution to give the (Bonferroni yet uncorrected) 2-sided significance.

This comparison test looks very conservative. It failed to praise the pair V1-V2 as significant: Z=1.838, p=.066 despite that the omnibus Friedman is strongly significant: p=.002. In contrast, Sign test for pair V1-V2 (it will be the same irrespective whether you perform it on the raw values or on the ranks left from Friedman) has Z=3.575, p=.0004.

One reason the SPSS "Dunn's approach" is quite conservative is its st. error formula accounting for all the $k$, not 2, variables.

Another reason why it is so less powerful than the Sign test is that it bases itself on all the $n$ cases, including those with ties, while Sign test discards cases with ties; and there is many cases with ties in our data. The problem of power in conjunction with treatment of ties in tests such as Sign was observed, for example in this Q/A.

I took V1 and V2 and, for cases with ties, untied them in a random fashion (by adding negative or positive noise), and computed Sign test (now based on all $n$ cases of course). 500 such trials gave me mean Z=1.927, which is now far from Z=3.575 and much closer on the road of conservatism towards the observed Dunn's Z=1.838.

I feel myself dissatisfied with SPSS' "Dunn's" pairwise comparisons as they are too conservative/weak. We expect that if an omnibus test is significant post hoc tests will confirm it often, if not ever. In our example, even Bonferroni-uncorrected p-value could not support the omnibus conclusion.

Is SPSS at all correct in adopting the "Dunn's approach" (originally proposed for Kruskal-Wallis; see also this Q/A) for Friedman post-hoc testing? I can't say, being hardly an expert in multiple comparisons. I would encourage somebody who knows it to comment or post a really helpful answer on this thread.

P.S. I'm quite aware that, while Friedman test can be seen as an extension of Sign test from 2 to $k$ samples (variables), a pairwise post hoc test after Friedman is not and should not be exactly the Sign test. Neither it would be Wilcoxon paired-samle test. The "Dunn's approach" (if adapted to paired-sample situation) looks plausible post hoc because it compares, without further ranking, the "horizontal" ranks obtained at Friedman and reflecting all the $k$ samples. What bothered me, though, was that the approach appeared overconservative in the example of the post.

Later Addition. To me, Dunn's approach as it is implemented after Friedman's test in SPSS is incorrect. It does not adjust for ties in the same fashion as the parent omnibus test (Friedman) does it. Actually, it does not adjust for the ties at all, while it should. (The issue of ties handling is touched in the current answer above.)

The formula of Friedman's test statistic (explained in SPSS Algorithms) is $$\chi^2= \frac{[12/(nk(k+1))]\sum^k C^2-3n(k+1)}{1-\Sigma T/[nk(k^2-1)]}$$

The denominator of the formula contains the adjustment for ties. If $k=2$ then quantity $\Sigma T/[nk(k^2-1)]$ is the proportion of cases in which the two variables are equal (tied).

Consider Friedman test performed with our variables V1 and V2 ($k=2$). The proportion of cases with ties is 287/400=.7175 and the test statistic is 13.460, df=1 with significance p=.00024. But the "Dunn's" comparison computed following SPSS formulas will be

Sample1  Sample2  MeanRank1 MeanRank2 TestStat  StError   Z    Sig2side  AdjSig
  V1       V2      1.54875   1.45125   .0975     .0500  1.9500  .05118  .05118

Nonsignificant. Why? No proper (Friedman style) adjustment for ties was done.

In the presense of only $k=2$ samples in data a correct post hoc pairwise comparison test must give the same result (statistic and p-value) as the omnibus test - it is actually a property which proves that the post hoc test corresponds (is isomorphic) to the parent omnibus test. It is indeed so with Kruskal-Wallis test and Dunn's test - just program it following SPSS Algorithms and test with V1 and V2 as two independent groups, and you'll get same p=.0153 both for KW and for Dunn. But we saw that a similar equivalence is absent in relations between Friedman test and "Dunn's approach" post-Friedman comparison test.

Conclusion. Post hoc multiple comparison test being performed by SPSS (version 22 and earlier) after Friedman's test is defective. Maybe it is correct when there is no ties, but I don't know. The post hoc test does not treat ties the way Friedman does it (while it should). I cannot say anything about the formula of st. error, sqrt[k*(k+1)/(6n)], they are using: it was derived from discrete uniform distribution, but they didn't write how; is it correct? Either the "Dunn's test approach" was adapted to Friedman inadequatly by SPSS or Dunn's test cannot be adapted to Friedman at all.

Best Answer

Related Solutions

Solved – How to correctly apply the Nemenyi post-hoc test after the Friedman test

Solved – Friedman’s test is very significant, but its post hoc comparisons (SPSS) are not significant

Related Question