Solved – Statistical testing: Multiple classifiers, 1 domain. Would rANOVA be appropriate

anovaclassificationmachine learningt-test

When comparing the performance of two classifiers over a single domain, in the context of a classification problem in machine learning, it is common to use a paired t-test, using the 10 average results from 10×10-fold cross-validation as measurements, where the folds at each iteration are the same for the two classifiers.

An obvious yet incorrect generalization of this t-test to the case where there are multiple classifiers would be to take all pairs, or just the interesting pairs if we only want to compare 1 classifier to all the others, and to work from there. The problem, however, is that the results of these multiple t-tests are not independent from one another.

I've heard of the ANOVA as a generalization of the t-test to situations with more than 2 groups to compare. I thought it was the solution to this problem, but then I read more and stumbled upon the repeated measures ANOVA, which seems to be even closer to what I'm looking for.

Could anyone confirm (or invalidate) that a repeated measures ANOVA is indeed what should be used in this situation? What is the difference between a regular ANOVA and the rANOVA in this case?

Best Answer

An ANOVA with repeated measures is used if you want to compare more than 2 group means where the participants are the same in each group.

In your ML-scenario, you draw samples from either an ordinary or multiple n-fold cross-validation. Thus, the data (->participants, independent variable) from which the samples (->performance measures, dependent variable) are drawn are paired. In the case of two classifiers, you'd use t-tests for paired data (and you can still do this with Bonferroni/Holm/Hochberg adjustments for multiple comparisons in post-hoc testing), and in the case of >2 classifiers, a repeated measures ANOVA is appropriate.

Please also see "Japkowicz/Shah: Evaluating Learning Algorithms. A Classification Perspective, 2011, p. 240 ff." for the repeated measures ANOVA and the other book sections for more scenarios in this domain (2/>2 classifiers on single/multiple domains) and the suggested parametric / nonparametric tests. Quoting the author from a private email, "The case of multiple classifiers on a single domain is basically a special case of the multiple classifiers on multiple domains and hence in principle the omnibus tests you mention can be used. However, this in practice boils down to multiple comparisons of two classifiers and hence you can directly apply the post-hoc tests. Please do keep in mind though that the confidence level (\alpha) would need to be adapted since there are multiple comparisons being made (in like with the description of the Bonferroni-Dunn test)."