Solved – Testing for classification significance

classificationstatistical significance

I have a two class problem at hand, 160 data samples which were classified with a linear support vector machine. The obtained classification accuracy (test accuracy) is 71% (this is the average over 70 folds).
I want to calculate now the p-value for this result, i.e. the probability that this result is purely due to chance. However I have not found a clear (and for my level understandable) description of how to do that, and am not sure if I need more information about the dataset to be able to perform such a test.
Any help appreciated

Best Answer

It is very unusual to perform a significance test on a classifier (also it is very unusual to use a 70 fold on a 160 dataset - the most common is 5 or 10 folds. For the number of folds you used you could have chosen a Leave-one-out procedure)

The issue is the null hypothesis. You probably want to know if your classifier is significantly better than a random classifier - one that did not really learned anything from the data.

Let us assume that the dataset is binary (only two classes, + and -) where p+ is the proportion of positive classes in the dataset. Let us assume the classifier that randomly answers + with 50% probability. The chance that a data will be + is p+. Finally since the classifier output is independent of the data value itself, the probability that the classifier will be correct on a + prediction is 0.5*p+. Similarly, the probability of being right on a - prediction is 0.5*p-.

If p+ is 0.5, than the classifier will be right 0.5 of the time. And that is the null hypothesis for the situation where p+=0.5.

But if p+=0.9, a classifier that guesses + with 0.5 probability will still have a

 0.5*0.9+0.5*0.1 = 0.5 

probability of being right. But a "smarter" random classifier, that makes a + guess with 0.9 probability, will have an accuracy of

0.9*0.9+0.1*0.1 = 0.82

probability of being right, which is the maximum probability for a random classifier.

Thus, the null hypothesis for a daaset with p+ proportion of positives is an accuracy of

acc_null = p+^2 + p-^2

So you need to collect the p+ and p- of your dataset and compute the acc_null.

The question now is whether your 71% accuracy is significantly different than acc_null. Than can only be answered if you know the number of times your classified was right, and you know it. Of the 160 data points, the classifier was correct 0.71*160 = 133.6 = 134 times.

Thus you need a binomial test to figure out the probability that a random process that generates a "correct" or a 1 or a "success" with probability acc_null would have generated 134 "correct" ou "success" of 160 tries. This is the p-value you are looking for.

Related Question