Solved – AUC and class imbalance in training/test dataset

aucmodel-evaluationroc

I just start to learn the Area under the ROC curve (AUC). I am told that AUC is not reflected by data imbalance. I think it means that AUC is insensitive to imbalance in test data, rather than imbalance in training data.

In other words, only changing the distribution of positive and negative classes in the test data, the AUC value may not change much. But if we change the distribution in the training data, the AUC value may largely change. The reason is that the classifier cannot be learned well. In this case, we have to use undersampling and oversampling. Am I right? I just want to make sure my understanding on AUC is correct.

Best Answer

It depends how you mean the word sensitive. The ROC AUC is sensitive to class imbalance in the sense that when there is a minority class, you typically define this as the positive class and it will have a strong impact on the AUC value. This is very much desirable behaviour. Accuracy is for example not sensitive in that way. It can be very high even if the minority class is not well predicted at all.

In most experimental setups (bootstrap or cross validation for example) the class distribution of training and test sets should be similar. But this is a result of how you sample those sets, not of using or not using ROC. Basically you are right to say that the ROC makes abstraction of class imbalance in the test set by giving equal importance to sensitivity and specificity. When the training set doesn't contain enough examples to learn the class, this will still affect ROC though, as it should.

What you do in terms of oversampling and parameter tuning is a separate issue. The ROC can only ever tell you how well a specific configuration works. You can then try multiple config's and select the best.