Solved – Using ROC curve for balanced data

machine learningrocunbalanced-classes

I understand that using the area under the ROC curve is a useful error measurement for unbalanced data. What happens if we use it for balanced data?

Best Answer

ROC curves are insensitive to class balance, so they can be used in any setting. Area under the ROC curve is not the same as accuracy. Accuracy is determined based on a single contingency table, e.g. a single classification threshold. Area under the curve summarizes performance based on all thresholds and is therefore much more informative.

The problem with ROC curves is that, for highly unbalanced data, the differences between curves tend to be small (but present!). Precision-recall curves are better in that regard, since you can spot large differences between classifiers in unbalanced settings for which the difference in ROC space seems small. This is shown in figure 4 in this paper, a difference of 4.5% in ROC space corresponds to 25% in PR space.