Solved – Statistical significance when comparing two models for classification

classificationmachine learningstatistical significance

I have been reading many deep learning papers. In some of them, I see the term statistical significance when they compare predicted results by models in a given dataset.

So, suppose you have two classifiers A and B. You use these models to classify a dataset with 1000 samples and get the accuracies X and Y for A and B respectively.

Could you give examples when one of the models is/isn't better/worse with statistical significance?

I know that this question has to do with null hypothesis, p-value and related topics. However, I can't figure out how to relate this to a dataset and models predicting labels from the dataset.

Best Answer

Simply speaking, the performance metrics used are statistics derived from our test set. We can go ahead and compute confidence intervals around these statistics as we would do in a classical setting.

For example, let's say we use Accuracy (which is not good metric for classification), i.e. the proportion of correctly classified items in our test set. We can treat this statistics as coming from a binomial distribution and ask about its correspond binomial proportion confidence intervals. Let's say that we have $N=100$ training points and classifier $C_1$ classified $80$ items correctly while classifier $C_2$ classified $83$ items correctly. The Wilson confidence interval for a type I error probability $\alpha =0.05$ would be $[0.711, 0.866]$ for classifier $C_1$ and $[0.744, 0.891]$ for $C_2$. Usual hypothesis testing would suggest that $C_1$ and $C_2$ do not have substantially different performance in terms of accuracy. What if we had $N = 10000$ and classifier $C_1$ classified $8000$ items correctly while classifier $C_2$ classified $8300$ items correctly? The confidence intervals would be $[0.792, 0.807]$ and $[0.822, 0.837]$ for classifiers $C_1$ and $C_2$ respectively. This would suggest that $C_1$ and $C_2$ have different performance on this test set.

Notice, that I simply used a parametric approximation to get the CIs for Accuracy. I would strongly suggest using bootstrapping to get a non-parametric estimate of the distribution of metric of interest. You can then use a paired sample hypothesis test.

I would suggest looking at some classic references like: "Approximate statistical tests for comparing supervised classification learning algorithms" by Dietterich or "Statistical comparisons of classifiers over multiple data sets" by Demšar for more details; they explicit look into paired $t$-tests and ANOVA approaches. I also found Derrac et al.'s "A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms" quite nice to follow (and more generally applicable than its title would suggest).

Related Question