Solved – When is an AUC score misleadingly high

aucmetricroc

I have an algorithm which gives an AUC (area under the receiver operating curve) of 0.94.

I mean, this is amazing, but… probably too amazing, considering the difficulty of the task I am working on. So how can I tell if the AUC is valid, or misleadingly high?

(P.S. yes, I am training on the training set and testing on the completely separate testing set.)

Best Answer

One possible reason you can get high AUROC with what some might consider a mediocre prediction is if you have imbalanced data (in favor of the "zero" prediction), high recall, and low precision. That is, you're predicting most of the ones at the higher end of your prediction probabilities, but most of the outcomes at the higher end of your prediction probabilities are still zero. This is because the ROC score still gets most of its "lift" at the early part of the plot, i.e., for only a small fraction of the zero-predictions.

For example, if 5% of the test set are "ones" and all of the ones appear in the top 10% of your predictions, then your AUC will be at least 18/19 because, after 18/19 of the zeroes are predicted, already 100% of the ones were predicted. Even if the top 5% are all zeroes.

A simple python example:

import sklearn
import numpy as np

yTest = [0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
yPredicted = np.linspace(0.9, 0.1, num=len(yTest))
sklearn.metrics.roc_auc_score(yTest, yPredicted) # ~0.89

import matplotlib.pyplot as plt
fpr, tpr, threshold = sklearn.metrics.roc_curve(yTest, yPredicted)
plt.plot(fpr, tpr)

Whether this is a "bad" prediction depends on your priorities. If you think that false negatives are terrible and false positives are tolerable, then this prediction is okay. But if it's the opposite, then this prediction is pretty bad.

Related Solutions

Solved – Can someone sort me out regarding the calculation of AUC

Go back to basics. The AUROC is mainly a good measure because of a coincidence: it equals the concordance probability ($c$-index; $U$-statistic) commonly used in rank correlation measures and the Wilcoxon-Mann-Whitney statistic. Concordance is an excellent measure of separation or predictive discrimination. So calculate it efficiently:

require(Hmisc)
somers2(predicted, real)
         C        Dxy          n    Missing 
 0.9545455  0.9090909 20.0000000  0.0000000

The efficient calculation is essentially a one-liner in somers2:

c.index <- (mean(rank(x)[y == 1]) - (n1 + 1)/2)/(n - n1)

But be clear on why you are using AUROC in the first place. It is a nice supplement to log-likelihood-based measures but not a substitute for the gold standard.

ROC AUC – Is the AUC the Probability of Correctly Classifying a Randomly Selected Instance from Each Class?

The quotation is slightly incorrect. The correct statement is that ROC AUC is the probability a randomly-chosen positive example is ranked more highly than a randomly-chosen negative example. This is due to the relationship between ROC AUC and the Wilcoxon test of ranks.

You will find the discussion in Tom Fawcett "An Introduction to ROC Analysis" illuminating.

Best Answer

Related Solutions

Solved – Can someone sort me out regarding the calculation of AUC

ROC AUC – Is the AUC the Probability of Correctly Classifying a Randomly Selected Instance from Each Class?

Related Question