Solved – Did I just invent a Bayesian method for analysis of ROC curves

aucbayesianmachine learningrocsampling

Preamble

This is a long post. If you're re-reading this, please note that I've revised the question portion, though the background material remains the same. Additionally, I believe that I've devised a solution to the problem. That solution appears at the bottom of the post. Thanks to CliffAB for pointing out that my original solution (edited out of this post; see edit history for that solution) necessarily produced biased estimates.

Problem

In machine learning classification problems, one way to assess model performance is by comparing ROC curves, or area under the ROC curve (AUC). However, it's my observation that there's precious little discussion of the variability of ROC curves or estimates of AUC; that is, they're statistics estimated from data, and so have some error associated with them. Characterizing the error in these estimates will help characterize, for example, whether one classifier is, indeed, superior to another.

I've developed the following approach, which I call Bayesian analysis of ROC curves, to address this the problem. There are two key observations in my thinking about the problem:

  1. ROC curves are composed of estimated quantities from the data, and are amenable to Bayesian analysis.

    The ROC curve is composed by plotting the true positive rate $TPR(\theta)$ against the false positive rate $FPR(\theta)$, each of which is, itself, estimated from the data. I consider $TPR$ and $FPR$ functions of $\theta$, the decision threshold used to sort class A from B (tree votes in a random forest, distance from a hyperplane in SVM, predicted probabilities in a logistic regression, etc.). Varying the value of the decision threshold $\theta$ will return different estimates of $TPR$ and $FPR$. Moreover, we can consider $TPR(\theta)$ to be an estimate of the success probability in a sequence of Bernoulli trials. In fact, TPR is defined as $\frac{TP}{TP+FN},$ which is also the MLE of the binomial success probability in an experiment with $TP$ successes and $TP+FN>0$ total trials.

    So by considering the output of $TPR(\theta)$ and $FPR(\theta)$ to be random variables, we are faced with a problem of estimating the success probability of a binomial experiment in which the number of successes and failures are known exactly (given by $TP$, $FP$, $FN$ and $TN$, which I assume are all fixed). Conventionally, one simply uses the MLE, and assumes that TPR and FPR are fixed for specific values of $\theta$. But in my Bayesian analysis of ROC curves, I draw posterior simulations of ROC curves, which are obtained by drawing samples from the posterior distribution over ROC curves. A standard Bayesan model for this problem is a binomial likelihood with a beta prior on the success probability; the posterior distribution on the success probability is also beta, so for each $\theta$, we have a posterior distribution of TPR and FPR values. This brings us to my second observation.

  2. ROC curves are non-decreasing. So once one has sampled some value of $TPR(\theta)$ and $FPR(\theta)$, there is zero probability of sampling a point in ROC space "southeast" of the sampled point. But shape-constrained sampling is a hard problem.

The Bayesian approach can be used to simulate large number of AUCs from a single set of estimates. For example, 20 simulations look like this compared to the original data.
Simulated ROC curves

This method has a number of advantages. For example, the probability that the AUC of one model is greater than another can be directly estimated by comparing the AUC of their posterior simulations. Estimates of variance can be obtained via simulation, which is cheaper than resampling methods, and these estimates do not incur the problem of correlated samples which arise from resampling methods.

Solution

I developed a solution to this problem by making a third and fourth observation about the nature of the problem, in addition to the two above.

  1. $TPR(\theta)$ and $FPR(\theta)$ have marginal densities which are amenable to simulation.

    If $TPR(\theta)$ (vice $FPR(\theta)$) is a beta-distributed random variable with parameters $TP$ and $FN$ (vice $FP$ and $TN$), we can also consider what the density of TPR is averaging over the several different values $\theta$ which correspond to our analysis. That is, we can consider a hierarchical process where one samples a value $\tilde{\theta}$ from the collection of $\theta$ values obtained by our out-of-sample model predictions, and then samples a value of $TPR(\tilde{\theta})$. A distribution over the resulting samples of $TPR(\tilde{\theta})$ values is a density of the true positive rate that is unconditional on $\theta$ itself. Because we're assuming a beta model for $TPR(\theta)$, the resulting distribution is a mixture of beta distributions, with a number of components $c$ equal to the size of our collection of $\theta$, and mixture coefficients $1/c$.

    In this example, I obtained the following CDF on TPR. Notably, due to the degeneracy of beta distributions where one of the parameters is zero, some of the mixture components are Dirac delta function at 0 or 1. This is what causes the the sudden spikes at 0 and 1. These "spikes" imply that these densities are neither continuous nor discrete. A choice of prior which is positive in both parameters would have the effect of "smoothing" these sudden spikes (not shown), but the resulting ROC curves will be pulled towards the prior. The same can be done for FPR (not shown). Drawing samples from the marginal densities is a simple application of inverse transform sampling.

CDF of TPR

  1. To solve the shape-constraint requirement, we just have to sort TPR and FPR independently.

    The non-decreasing requirement is the same as the requirement that the marginal samples from TPR and FPR are sorted independently — that is, the shape of the ROC curve is completely determined by the requirement that the smallest TPR value be paired with the smallest FPR value and so on, which means that the construction of a shape-constrained random sample is trivial here. For the improper $\text{Beta}(0,0)$ prior, simulations provide evidence that constructing an ROC curve in this manner produces samples with mean AUC that converges to the original AUC in the limit of a large number of samples. Below is a KDE of 2000 simulations.

enter image description here

Comparison to the Bootstrap

In a lengthy chat discussion with @AdamO (thanks, AdamO!), he pointed out that there are several established methods to compare two ROC curves, or to characterize the variability of a single ROC curve, among them the bootstrap. So as an experiment, I tried bootstrapping the my example which as $n=20$ observations in the holdout set and comparing the results to the Bayesian method. The results are compared below (The bootstrap implementation here is the simple bootstrap — randomly sampling with replacement to the size of the original sample. Cursory reading on bootstraps exposes significant gaps in my knowledge on re-sampling methods, so perhaps this is not an appropriate approach.)

enter image description here

This demonstration shows that the mean of the bootstrap is biased below the mean of the original sample, and that the KDE of the bootstrap yields well-defined "humps". The genesis of these humps is hardly mysterious — the ROC curve will be sensitive to the inclusion of each point, and the effect of a small sample (here, n=20) is that the underlying statistic is more sensitive to the inclusion of each point. (Emphatically, this patterning is not an artifact of the kernel bandwidth — note the rug plot. Each stripe is several bootstrap replicates that have the same value. The bootstrap has 2000 replicates, but the number of distinct values is clearly much smaller. We can conclude that the humps are an intrinsic feature of the bootstrap procedure.) By contrast, mean Bayesian AUC estimates tend to be very near to the original estimate, and "smooth out" the humps that the bootstrap exhibits.

Question

My revised question is whether my revised solution is incorrect. A good answer will prove (or disprove) that the resulting samples of ROC curves are biased, or likewise prove or disprove other qualities of this approach.

Best Answer

First off, there is no accepted way to "analyze" a ROC curve: it is merely a graphic that portrays the predictive ability of a classification model. You can certainly summarize a ROC curve using a c-statistic or the AUC, but calculating confidence intervals and performing inference using $c$-statistics is well understood due to its relation to the Wilcoxon U-statistic.

It's generally fairly well accepted that you can estimate the variability in ROC curves using the bootstrap cf Pepe Etzione Feng. This is a nice approach because the ROC curve is an empirical estimate and the bootstrap is non-parametric. Parameterizing anything in such a fashion introduces assumptions and complications such as "is a flat prior really noninformative?" I am not convinced this is the case here.

Lastly, there's the issue of pseudo likelihood. You can induced variability in the ROC curves by putting a prior on $\theta$ which, in all of ROC usage, is the only thing which is typically not considered a random variable. You have then assumed that the variability in TPR and FPR induced by variability in $\theta$ are independent. They are not. In fact they are completely dependent. You are sort calculating a Bayesian posterior for your own weight in kilograms and pounds and saying they do not depend on each other.

Take, as an example, a model with perfect discrimination. Using your method, you will find that the confidence bands are the unit square. They are not! There is no variability in a model with perfect discrimination. A bootstrap will show you that.

If one were to approach the issue of ROC "analysis" from a Bayesian perspective, it would perhaps be most useful to address the problem of model selection by putting a prior on the space of models used for analysis. That would be a very interesting problem.