Data-Visualization – How to Visualize the Calibration of Predicted Probability of a Model

binary datacalibrationclassificationdata visualizationpredictive-models

Suppose I have a predictive model that produces, for each instance, a probability for each class. Now I recognize that there are many ways to evaluate such a model if I want to use those probabilities for classification (precision, recall, etc.). I also recognize that an ROC curve and the area under it can be used to determine how well the model differentiates between classes. Those are not what I'm asking about.

I'm interested in assessing the calibration of the model. I know that a scoring rule like the Brier score can be useful for this task. That's OK, and I'll likely incorporate something along those lines, but I'm not sure how intuitive such metrics will be for the lay person. I'm looking for something more visual. I want the person interpreting results to be able to see whether or not when the model predicts something is 70% likely to happen that it actually happens ~70% of the time, etc.

I heard of (but never used) Q-Q plots, and at first I thought this was what I was looking for. However, it seems that is really meant for comparing two probability distributions. That's not directly what I have. I have, for a bunch of instances, my predicted probability and then whether the event actually occurred:

Index    P(Heads)    Actual Result
    1          .4            Heads
    2          .3            Tails
    3          .7            Heads
    4         .65            Tails
  ...         ...              ...

So is a Q-Q plot really what I want, or am I looking for something else? If a Q-Q plot is what I should be using, what is the correct way to transform my data into probability distributions?

I imagine I could sort both columns by predicted probability and then create some bins. Is that the type of thing I should be doing, or am I off in my thinking somewhere? I'm familiar with various discretization techniques, but is there a specific way of discretizing into bins that is standard for this sort of thing?

Best Answer

Your thinking is good.

John Tukey recommended binning by halves: split the data into upper and lower halves, then split those halves, then split the extreme halves recursively. Compared to equal-width binning, this allows visual inspection of tail behavior without devoting too many graphical elements to the bulk of the data (in the middle).

Here is an example (using R) of Tukey's approach. (It's not exactly the same: he implemented mletter a little differently.)

First, let's create some predictions and some outcomes that conform to those predictions:

set.seed(17)
prediction <- rbeta(500, 3/2, 5/2)
actual <- rbinom(length(prediction), 1, prediction)
plot(prediction, actual, col="Gray", cex=0.8)

The plot is not very informative, because all the actual values are, of course, either $0$ (did not occur) or $1$ (did occur). (It appears as the background of gray open circles in the first figure below.) This plot needs smoothing. To do so, we bin the data. Function mletter does the splitting-by-halves. Its first argument r is an array of ranks between 1 and n (the second argument). It returns unique (numeric) identifiers for each bin:

mletter <- function(r,n) {
    lower <-  2 + floor(log(r/(n+1))/log(2))
    upper <- -1 - floor(log((n+1-r)/(n+1))/log(2))
    i <- 2*r > n
    lower[i] <- upper[i]
    lower
}

Using this, we bin both the predictions and the outcomes and average each within each bin. Along the way, we compute bin populations:

classes <- mletter(rank(prediction), length(prediction))
pgroups <- split(prediction, classes)
agroups <- split(actual, classes)
bincounts <- unlist(lapply(pgroups, length)) # Bin populations
x <- unlist(lapply(pgroups, mean))           # Mean predicted values by bin
y <- unlist(lapply(agroups, mean))           # Mean outcome by bin

To symbolize the plot effectively we should make the symbol areas proportional to bin counts. It can be helpful to vary the symbol colors a little, too, whence:

binprop <- bincounts / max(bincounts)
colors <- -log(binprop)/log(2)
colors <- colors - min(colors)
colors <- hsv(colors / (max(colors)+1))

With these in hand, we now enhance the preceding plot:

abline(0,1, lty=1, col="Gray")                           # Reference curve
points(x,y, pch=19, cex = 3 * sqrt(binprop), col=colors) # Solid colored circles
points(x,y, pch=1, cex = 3 * sqrt(binprop))              # Circle outlines

Figure

As an example of a poor prediction, let's change the data:

set.seed(17)
prediction <- rbeta(500, 5/2, 1)
actual <- rbinom(length(prediction), 1, 1/2 + 4*(prediction-1/2)^3)

Repeating the analysis produces this plot in which the deviations are clear:

Figure 2

This model tends to be overoptimistic (average outcome for predictions in the 50% to 90% range are too low). In the few cases where the prediction is low (less than 30%), the model is too pessimistic.

Related Question