Solved – How to Compare Two Algorithms with Multiple Datasets and Multiple Runs

anovahypothesis testingmachine learningrepeated measuresstatistical significance

I have two computational methods (A and B) that have a random behavior each, i.e., if you run the same methods 10 times, you get 10 different results (usually with a small variance). To compare both methods, we selected 5 different databases (its hard to get more) and ran method A and method B, 10 times each, on each of the five databases. This resulted in a 10x5 matrix of measurements (a row for each run and a column for each database) for each method. All measurements are paired between the two methods, because we can control the seed for each run and the database can be reused for both methods, i.e., $\text{run}_i$ of $\text{database}_j$ use the same $\text{seed}_i$ for both methods.

Example (the values in the tables are the accuracies of the methods):

Method A

+-------------+--------+--------+--------+--------+--------+
| Run/Database|   1    |   2    |   3    |   4    |   5    |
+-------------+--------+--------+--------+--------+--------+
|           1 | 88.92% | 44.60% | 69.49% | 73.37% | 85.63% |
|           2 | 89.00% | 42.72% | 64.10% | 71.94% | 85.92% |
|           3 | 88.35% | 45.07% | 65.13% | 72.14% | 85.78% |
|           4 | 88.92% | 43.66% | 67.95% | 72.76% | 85.28% |
|           5 | 87.94% | 50.23% | 67.18% | 71.94% | 85.92% |
|           6 | 87.78% | 43.19% | 68.72% | 73.47% | 86.27% |
|           7 | 89.08% | 45.54% | 66.41% | 71.33% | 85.56% |
|           8 | 88.83% | 42.72% | 66.15% | 72.45% | 86.77% |
|           9 | 88.43% | 45.07% | 68.97% | 72.45% | 86.49% |
|          10 | 88.59% | 40.38% | 66.15% | 73.67% | 86.13% |
+-------------+--------+--------+--------+--------+--------+

Method B

+-------------+--------+--------+--------+--------+--------+
| Run/Database|   1    |   2    |   3    |   4    |   5    |
+-------------+--------+--------+--------+--------+--------+
|           1 | 22.73% | 53.99% | 59.74% | 65.20% | 79.59% |
|           2 | 75.97% | 46.95% | 58.46% | 71.63% | 84.42% |
|           3 | 76.94% | 53.05% | 58.97% | 68.37% | 85.06% |
|           4 | 76.54% | 42.25% | 46.67% | 68.67% | 85.92% |
|           5 | 46.60% | 52.11% | 52.82% | 68.98% | 85.14% |
|           6 | 76.78% | 48.83% | 55.90% | 68.27% | 78.38% |
|           7 | 79.37% | 47.89% | 58.72% | 71.12% | 85.06% |
|           8 | 77.83% | 54.93% | 50.77% | 72.14% | 87.06% |
|           9 | 83.01% | 46.95% | 56.15% | 67.96% | 84.92% |
|          10 | 78.24% | 49.30% | 58.21% | 67.96% | 81.29% |
+-------------+--------+--------+--------+--------+--------+

Which statistical method shall I use to find out which method is the best in terms of overall performance? Or to find out if method A is statistically different, in terms of average accuracy, from method B?

I investigated the use of Student T-Test and One and Two-Way ANOVA with Repeated Measurements, but they didn't seem appropriate for this analysis. Any suggestion of valid statistical analysis is appreciated.

Best Answer

There is a paper which concretely studies this question in detail (Statistical Comparisons of Classifiers over Multiple Data Sets) with pretty sobering conclusions.

It is actually very tricky. As you note, your methods are not deterministic and yield a different result each time. That means that in some cases, A might be better than B by pure chance, but if you run the test several times, in average, B might be better than A.

Regardless of the random behaviour of your methods, by testing two methods over different databases, sometimes A will be better than B and viceversa by chance. There is no universally better algorithm. Another way to look at it is: it makes no sense to compare tests obtained with the same methods on different data sets if they are not commensurate.

The basic takeaway message of the paper is that, if you cannot guarantee that the assumptions made by parametric tests are fulfilled (be it ANOVA or t-test), then it is better to make use of non-parametric tests (Wilcoxon test of Friedman's test). And that seems to be indeed your case. See also these slides, specially #34, for a very nice summary.

Related Question