Solved – ROC curve drawbacks

logisticroc

In the class yesterday, we were taught about logistic and subsequently the ROC curve and how to use it.

My questions are:

  1. Is this the most common way to identify if the logistic model is the best? If not, what are other common methods?

  2. What are the possible drawbacks of using ROC curve to judge whether to use the model or not?

Best Answer

ROC AUC has the property that it coincides with the $c$ statistic. The $c$ statistic measures the probability that a positive example is ranked higher than a negative example. In this sense, the ROC AUC answers the question of how well the model discriminates between the two classes.

A model with high discrimination is not necessarily well calibrated. Suppose a logistic regression model predicts probabilities of 0.52 for positives and 0.51 for negatives. This model has an AUC of 1, but the probabilities aren't helpful in the sense of identifying which purported positives are highest-risk. Because all of the positives are assigned the same posterior probability, they can't be differentiated.

Moreover, a well-calibrated model will have its maximum ROC AUC fixed by the ratio of positives to negatives in the data. This means that a model which has some very desirable probabilities (i.e. its posterior probabilities match the true probability) has a cap on its performance, and therefore an uncalibrated model could "dominate" in terms of ROC AUC.

ROC AUC doesn't tell you anything about the costs of different kinds of errors. For example, if you're trying to detect fraud, a 10,000 dollar purchase of uncertain provenance represents a larger potential loss than a 10 dollar purchase. But ROC AUC would treat both events as if they have the same weight -- obviously any reasonable model should be able to distinguish between these two types of error.

ROC AUC also tends to be dominated by the "high FPR" points. Depending on the application, these points may be the least relevant. Consider the case where the model is used to refer high-risk transactions to experts who will conduct further vetting. There may only be enough humans to assess 50 transactions per unit time; since the most highly-ranked transactions occur on the "left hand" size of the ROC curve by definition, this is also the region with the lowest area. So by looking at the whole AUC, you're optimistically biasing your results upwards, i.e. ROC AUC is buoyed by the observations "to the right" of the actual set of observations which humans will vet. (Illustration is simple. Draw a vertical line at FPR<0.5 on any ROC curve. The area to left is higher for all such vertical lines.) To avoid this, some people use partial ROC AUC, which has its own host of problems, chief among them that software implementations tend to assume that you're interested in truncation at some value of FPR. But in the case that you care about the top $n$ transactions, this approach is obviously wrong because the top $n$ transactions will happen at different FPR values for different classifiers. Standardization of partial AUC (to preserve the property that AUC < 0.5 is worse than random, 1 is perfect, 0 is worthless) incurs further difficulties.

The ROC curve itself is of little interest. "Dominating" classifiers can be assessed by AUC. Stochastic equivalence can be assessed by tests of equivalence of ranks. Prof. Harrell's comment drives at a consistent theme of his work, which is that the real question diagnostics should answer is one of risk assessment and utility optimization. Examining ROC AUC tends to encourage selection of truncation points, which should be avoided because it only provides partial information to decision makers.

Alternative measures of performance (e.g. log-likelihood) characterize the calibration of the model and proper scoring rules generally have the quality that they encourage honest forecasts.