Solved – the chance level accuracy in unbalanced classification problems

binary dataclassificationcross-validationsvmunbalanced-classes

Suppose one has a balanced classification problem (50% of 0's and 50% of 1's). In such a case, the so called chance-level accuracy of classifier would be 50%.

What is the chance-level accuracy if the problem is an unbalanced one (e.g. 25% of 0's and 75% of 1's). Is it still 50%? If one would guess each group to be 1 then one would achieve 75% accuracy. However, assigning groups randomly would still (?) give 50% correct on average.

I'm using SVMs for the classification and 10-fold cross-validation for performance estimation if it matters.

Best Answer

The performance of a random classifier depends on the fraction of times it predicts positive, e.g. $P(\hat{y} = 1)$. A random model essentially means a model whose predictions $\hat{y}$ are independent of the true label $y$, which means: $$ P(\hat{y} = 1\ |\ y = 1) = P(\hat{y} = 1), $$ and $$ P(y = 1\ |\ \hat{y} = 1) = P(y = 1). $$ The probability of being right, that is the expected accuracy is then: $$ P(\hat{y} = y) = P(\hat{y} = 1) P(y = 1) + P(\hat{y} = 0) P(y = 0). $$ If the dataset is imbalanced, then the 'random' model with the best expected accuracy is the one that always predicts the majority class, with expected accuracy equal to the fraction of data in the majority class.

The main issue with highly imbalanced datasets (say 99% negative) is that you are likely to end up with trivial models as described above, that is a model which always predicts the majority class (negative) and achieves high accuracy (99%) so this useless model actually looks good. If you use a poor scoring function (such as accuracy) when optimizing hyperparameters you are quite likely to obtain a very bad model in imbalanced settings.

This is one of the many reasons why discrete measures like accuracy should be avoided. You won't have such issues with measures like area under the ROC or PR curve.