Solved – Why AUC is not a good performance metric for a classification model

aucclassificationmodel-evaluationroc

After understanding the benefits of AUC

I was stumbled to know that in some scenarios it might not be a good performance metric for evaluating a classification model. The below are the 2 scenarios:

1. Scale invariance is not always desirable:

For example, sometimes we really do need well-calibrated probability
outputs, and AUC won’t tell us about that.

2. Classification-threshold invariance is not always desirable:

In cases where there are wide disparities in the cost of false
negatives vs. false positives, it may be critical to minimize one type
of classification error. For example, when doing email spam detection,
you likely want to prioritize minimizing false positives (even if that
results in a significant increase in false negatives). AUC isn't a
useful metric for this type of optimization.

Questions:

  • Could anyone explain with an example what a well-calibrated ? probability outputs of a model are? and How AUC will fail to evaluate in this condition?

  • Could anyone suggest a good metric for second scenarios?

Best Answer

First a little disclaimer: I don't have the academic credentials to back anything up that I'm saying now. This is just what I use in practice.

There's the metric called Youden's index which is:

$$Y = -1 + sensitivity + specificity$$

If $Y = 0$ then your classification system is random, if $Y = 1$ then it is a perfect classification system.

It is possible to favor sensitivity over specificity or vice-versa by adding weights:

$$Y = -1 + sensitivity \cdot 2 \cdot w + specificity \cdot 2 \cdot (1 - w)$$

If your classifier is detecting spam and the cost of a false positive is high you want as much true negatives as possible and thus favor specificity over sensitivity. By adding weights you get an index of how good your classification system is given the associated costs of wrong classifications. You can also plot a $Y$ curve(s) (even multi-dimensional) in case you have parameters in your classification system and then calculate the area under the curve or volume (in case you have two parameters) or you can just sum up the $Y$s for each combination of parameters. This of course can easily be extended to multiple classes:

$$Y = \frac{-N + \sum_{i}{sensitivity_{i}\cdot 2\cdot w_{i} + specificity_{i}\cdot 2 \cdot (1-w_{i})}}{N}$$

I use this to compare neural networks that distinguish between multiple classes but it's more important to be able to correctly classify a few classes and the remaining classes are not that important (e.g. being able to recognize a stop sign is much more important than being able to recognize a sign that tells you at which time of the day you're allowed to park there). Weights allows me to do this be configuring how important it is to recognize something (sensitivity) and how important it is to be able to recognize that something isn't something (specificity) (e.g. I want a high specificity on the park sign but the sensitivity isn't that important and I want high sensitivity on the stop sign)

Related Question