Solved – Statistical Significance of multiple classifiers by using p-value

accuracyclassificationcross-validationp-valuestatistical significance

I had a classification problem. I had 675 samples, and used with 7 machine learning algorithms with 10 cross-validation for prediction. Lets say the following table is the accuracy result of each algorithm.

My supervisor asked me to perform a p-value test of the results, I have searched and read some, but I am not confident if such a test is appropriate. So, I wonder, is it possible (and makes sense) to get the p-value test of these data to claim result are statistically significant? Or it is a irrelevant test for such a context?

If the answer is yes p-value is appropriate, then how can I do it, is it possible with python or excel etc…

Many thanks

Accurcies

Best Answer

This is common practice but quite controversial among statisticians (Dietterich 1998) (Kohavi 1995)

For a t-test you need 30 or more samples if you cannot assume a normal distribution (which you really cannot). The common approach to this is to have 3 repetitions of a 10 fold CV. The folds should be shuffled by a random generator, but the same random folds should be used with each algorithm. Use a one sample t-test on the difference of performance.

T-tests also assume i.i.d. samples. The independence of the samples is not given in CV. This problem can be avoided by using very different non-parametric tests. Or you could use corrected resampled t-tests that correct precisely for this non independence in repeated cross validation settings. They are less powerful than normal t-tests, but arguably more so than the non parametric tests linked above. Weka uses these tests per default for example.

PS. If you compare 7 algorithms pairwise, you also need a Bonferroni adjustment or something similar to your p-value.

Related Question