Solved – How to derive the probabilistic interpretation of the AUC

aucprobabilityroc

Why is the area under the ROC curve the probability that a classifier will rank a randomly chosen "positive" instance (from the retrieved predictions) higher than a randomly chosen "positive" one (from the original positive class)?
How does one prove this statement mathematically using integral, giving the CDFs and PDFs of the true positive and negative class distributions?

Best Answer

First thing, let's try to define the area under the ROC curve formally. Some assumptions and definitions:

  • We have a probabilistic classifier that outputs a "score" s(x), where x are the features, and s is a generic increasing monotonic function of the estimated probability p(class = 1|x).

  • $f_{k}(s)$, with $k = \{0, 1\}$ := pdf of the scores for class k, with CDF $F_{k}(s)$

  • The classification of a new observation is obtained compraing the score s to a threshold t

Furthermore, for mathematical convenience, let's consider the positive class (event detected) k = 0, and negative k = 1. In this setting we can define:

  • Recall (aka Sensitivity, aka TPR): $F_{0}(t)$ (proportion of positive cases classified as positive)
  • Specificity (aka TNR): $1 - F_{1}(t)$ (proportion of negative cases classified as negative)
  • FPR (aka Fall-out): 1 - TNR = $F_{1}(t)$

The ROC curve is then a plot of $F_{0}(t)$ against $F_{1}(t)$. Setting $v = F_1(s)$, we can formally define the area under the ROC curve as: $$AUC =\int_{0}^{1} F_{0}(F_{1}^{-1}(v)) dv$$ Changing variable ($dv = f_{1}(s)ds$): $$AUC =\int_{ - \infty}^{\infty} F_{0}(s) f_{1}(s)ds$$

This formula can easiliy be seen to be the probability that a randomly drawn member of class 0 will produce a score lower than the score of a randomly drawn member of class 1.

This proof is taken from: https://pdfs.semanticscholar.org/1fcb/f15898db36990f651c1e5cdc0b405855de2c.pdf