Friedman Test vs Wilcoxon Test – Comparative Analysis in Hypothesis Testing

I'm trying to assess performance of a supervised machine learning classification algorithm. The observations fall into nominal classes (2 for the time being, however I'd like to generalize this to multi-class problems), drawn from a population of 99 subjects.

One of the questions I'd like to be able to answer is, if the algorithm exhibits a significant difference in classification accuracy between the input classes. For the binary classification case I am comparing mean accuracy between the classes across subjects using a paired Wilcoxon test (since the underlying distribution is non-normal). In order to generalize this procedure to multi-class problems I inteded to use a Friedman test.

However, the p values obtained by those two procedures in case of a binary IV vary wildly, with the Wilcoxon test yielding p < .001 whereas p = .25 for the Friedman test. This leads me to believe I have a fundamental misunderstanding of the structure of the Friedman test.

Is it not appropriate to use a Friedman test in this case to compare the outcome of the repeated measures of the accuracy across all subjects?

My R code to obtain those results (subject is the subject identifier, acc the accuracy DV and expected the observation class IV):

> head(subject.accuracy, n=10)
   subject expected        acc
1       10     none 0.97826087
2       10     high 0.55319149
3      101     none 1.00000000
4      101     high 0.68085106
5      103     none 0.97826087
6      103     high 1.00000000
7      104     none 1.00000000
8      104     high 0.08510638
9      105     none 0.95121951
10     105     high 1.00000000
> ddply(subject.accuracy, .(expected), summarise, mean.acc = mean(acc), se.acc = sd(acc)/sqrt(length(acc)))
  expected  mean.acc     se.acc
1     none 0.9750619 0.00317064
2     high 0.7571259 0.03491149
> wilcox.test(acc ~ expected, subject.accuracy, paired=T)

    Wilcoxon signed rank test with continuity correction

data:  acc by expected
V = 3125.5, p-value = 0.0003101
alternative hypothesis: true location shift is not equal to 0

> friedman.test(acc ~ expected | subject, subject.accuracy)

    Friedman rank sum test

data:  acc and expected and subject
Friedman chi-squared = 1.3011, df = 1, p-value = 0.254

Best Answer

Friedman test is not the extension of Wilcoxon test, so when you have only 2 related samples it is not the same as Wilcoxon signed rank test. The latter accounts for the magnitude of difference within a case (and then ranks it across cases), whereas Friedman only ranks within a case (and never across cases): it is less sensitive.

Friedman is actually almost the extension of sign test. With 2 samples, their p-values are very close, with Friedman being just slightly more conservative (these two tests treat ties in somewhat different ways). This small difference quickly vanishes as the sample size grows. So, for two related samples these two tests are really peer alternatives.

The test which is equivalent to Wilcoxon - in the same sense as Friedman to sign - is not very well known Quade test, mentioned for example here: http://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/friedman.htm.

Best Answer

Related Solutions

Significance-Testing – Cross-Validated Classification Accuracy: Shuffling vs Binomial Test

Solved – Why the Brier Score’s better when probabilities are estimated through PAVA instead of Platt Scaling

Related Question