Solved – Is it possible for a model to have higher sensitivity/specificity but lower accuracy and AUC

aucmodel-evaluationsensitivity-specificity

In the evaluation of classification models, I've found one model to have a higher accuracy and c-statistic (AUC) as compared to a second model. However, the second model has higher sensitivity, specificity, positive predictive value, and negative predictive value. Is this mathematically possible?

Best Answer

Definitions

Suppose this is a binary classification task, where your model estimates $\mathbb{P}(y_i=1 | x_i)$ for $y\in\{0,1\}$ and $x_i \in \mathbb{R}^p$.

Sensitivity and specificity characterize the true positive rate and true negative rate at some threshold $t$. This means that if you choose a different $t$, you'll have a different sensitivity & specificity.

The $c$-statistic is also known as the area under the ROC curve. The ROC curve plots the true positive rate on the vertical axis and the false positive rate on the horizontal axis. In other words, the ROC curve is a plot where each point is an estimate $\text{sensitivity}(t)$ and $1 - \text{specificity}(t)$ for all values $t$. More compactly, we could write that the curve is constructed from tuples $(\text{FPR}(t), \text{TPR}(t))$, which also emphasizes the dependence on $t$. A useful property of the $c$-statistic is that it estimates the probability that a randomly-selected positive has a higher score than a randomly-selected negative.

Accuracy is the fraction of correctly-classified instances at some threshold $t$, so accuracy likewise varies with the choice of $t$! In the binary case, it's common for people to arbitrarily pick $t=0.5$ but this is exactly that -- arbitrary.

Inferences

These statistics measure different things, so it is not necessarily surprising that one model can have a better score in one respect and a lower score in another.

For some toy examples, the effect can be wildly counter-intuitive. Suppose that your sample is balanced and your model gives estimates for all positives at 0.49 and all negatives at 0.48. All positives are ranked higher than negatives, so the $c$-statistic (ROC AUC) is 1.0. But the accuracy at $t=0.5$ is 0.5 because the sample is balanced, and only the negatives are correctly classified. Moreover, if you change the class composition (but the scores for the classes stay the same), you can arbitrarily change the accuracy, but the AUC will still be 1.0!

Moreover, the sensitivity and specificity statistics just characterize performance at single choices of threshold. Different choices of threshold achieve different trade-offs, so they might be preferable for some particular circumstance.

Experimental Results

Model 1 has higher accuracy at $t_0$ and $c$-statistic than model 2.

Model 2 has higher PPV, NPV and sensitivity and specificity at $t_1$ than model 1.

Is $t_0 = t_1$? That information is not stated. But we do know that the ROC curve for Model 2 has a point with the specified sensitivity and 1 - specificity for the threshold $t_1$. On the other hand, the total area of that ROC Curve is smaller than for Model 1. How is this possible? Just draw monotonic curves passing through the three points we know must be on the ROC curve: $(0,0)$, $(1,1)$, and $(\text{FPR}(t_1), \text{TPR}(t_1))$. You can make the curve have a large or small area, depending on your choice.

What does all of this mean? You'll have to make a choice about what kind of trade-offs you're willing to accept. Do you care about sensitivity and specificity at $t_1$? At $t_0$? Or would you prefer the TPR and FPR at a different value $t$ altogether? I can't answer that.

Some related questions: