Data-Visualization – How to Visualize the Calibration of Predicted Probability of a Model

binary datacalibrationclassificationdata visualizationpredictive-models

Suppose I have a predictive model that produces, for each instance, a probability for each class. Now I recognize that there are many ways to evaluate such a model if I want to use those probabilities for classification (precision, recall, etc.). I also recognize that an ROC curve and the area under it can be used to determine how well the model differentiates between classes. Those are not what I'm asking about.

I'm interested in assessing the calibration of the model. I know that a scoring rule like the Brier score can be useful for this task. That's OK, and I'll likely incorporate something along those lines, but I'm not sure how intuitive such metrics will be for the lay person. I'm looking for something more visual. I want the person interpreting results to be able to see whether or not when the model predicts something is 70% likely to happen that it actually happens ~70% of the time, etc.

I heard of (but never used) Q-Q plots, and at first I thought this was what I was looking for. However, it seems that is really meant for comparing two probability distributions. That's not directly what I have. I have, for a bunch of instances, my predicted probability and then whether the event actually occurred:

Index    P(Heads)    Actual Result
    1          .4            Heads
    2          .3            Tails
    3          .7            Heads
    4         .65            Tails
  ...         ...              ...

So is a Q-Q plot really what I want, or am I looking for something else? If a Q-Q plot is what I should be using, what is the correct way to transform my data into probability distributions?

I imagine I could sort both columns by predicted probability and then create some bins. Is that the type of thing I should be doing, or am I off in my thinking somewhere? I'm familiar with various discretization techniques, but is there a specific way of discretizing into bins that is standard for this sort of thing?

Best Answer

Your thinking is good.

John Tukey recommended binning by halves: split the data into upper and lower halves, then split those halves, then split the extreme halves recursively. Compared to equal-width binning, this allows visual inspection of tail behavior without devoting too many graphical elements to the bulk of the data (in the middle).

Here is an example (using R) of Tukey's approach. (It's not exactly the same: he implemented mletter a little differently.)

First, let's create some predictions and some outcomes that conform to those predictions:

set.seed(17)
prediction <- rbeta(500, 3/2, 5/2)
actual <- rbinom(length(prediction), 1, prediction)
plot(prediction, actual, col="Gray", cex=0.8)

The plot is not very informative, because all the actual values are, of course, either $0$ (did not occur) or $1$ (did occur). (It appears as the background of gray open circles in the first figure below.) This plot needs smoothing. To do so, we bin the data. Function mletter does the splitting-by-halves. Its first argument r is an array of ranks between 1 and n (the second argument). It returns unique (numeric) identifiers for each bin:

mletter <- function(r,n) {
    lower <-  2 + floor(log(r/(n+1))/log(2))
    upper <- -1 - floor(log((n+1-r)/(n+1))/log(2))
    i <- 2*r > n
    lower[i] <- upper[i]
    lower
}

Using this, we bin both the predictions and the outcomes and average each within each bin. Along the way, we compute bin populations:

classes <- mletter(rank(prediction), length(prediction))
pgroups <- split(prediction, classes)
agroups <- split(actual, classes)
bincounts <- unlist(lapply(pgroups, length)) # Bin populations
x <- unlist(lapply(pgroups, mean))           # Mean predicted values by bin
y <- unlist(lapply(agroups, mean))           # Mean outcome by bin

To symbolize the plot effectively we should make the symbol areas proportional to bin counts. It can be helpful to vary the symbol colors a little, too, whence:

binprop <- bincounts / max(bincounts)
colors <- -log(binprop)/log(2)
colors <- colors - min(colors)
colors <- hsv(colors / (max(colors)+1))

With these in hand, we now enhance the preceding plot:

abline(0,1, lty=1, col="Gray")                           # Reference curve
points(x,y, pch=19, cex = 3 * sqrt(binprop), col=colors) # Solid colored circles
points(x,y, pch=1, cex = 3 * sqrt(binprop))              # Circle outlines

As an example of a poor prediction, let's change the data:

set.seed(17)
prediction <- rbeta(500, 5/2, 1)
actual <- rbinom(length(prediction), 1, 1/2 + 4*(prediction-1/2)^3)

Repeating the analysis produces this plot in which the deviations are clear:

This model tends to be overoptimistic (average outcome for predictions in the 50% to 90% range are too low). In the few cases where the prediction is low (less than 30%), the model is too pessimistic.

Related Solutions

Solved – Visualizing mixed model results

Predicting counts using the fixed-effects part of your model means that you set to zero (i.e. their mean) the random effects. This means that you can "forget" about them and use standard machinery to calculate the predictions and the standard errors of the predictions (with which you can compute the confidence intervals).

This is an example using Stata, but I suppose it can be easily "translated" into R language:

webuse epilepsy, clear
xtmepoisson seizures treat visit || subject: visit
predict log_seiz, xb
gen pred_seiz = exp(log_seiz)
predict std_log_seiz, stdp
gen ub = exp(log_seiz+invnorm(.975)*std_log_seiz)
gen lb = exp(log_seiz-invnorm(.975)*std_log_seiz)

tw (line pred_seiz ub lb visit if treat == 0, sort lc(black black black) ///
 lp(l - -)), scheme(s1mono) legend(off) ytitle("Predicted Seizures") ///
 xtitle("Visit")

The graph refers to treat == 0 and it's intended to be an example (visit is not a really continuous variable, but it's just to get the idea). The dashed lines are 95% confidence intervals.

enter image description here

Classification – How to Measure Performance for Calibration in Binary Classification Problems

I ended up finding several measures in the literature (see e.g. CAL and MXE in the the paper Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria by Caruana and Niculescu Mizil).

The most useful measure appears to be the Mean Calibration Error (CAL), which is the weighted root-mean squared error (RMSE) between predicted probabilities and true probabilities on a calibration plot. Formally:

$$\text{CAL} = \frac{1}{N} \sqrt{\sum_{k=1}^K \sum_{i \in B_k} {(p_k - \hat{p}_i})^2}$$

where:

$\hat{p}_i$ is the predicted probability for example $i = 1,\ldots,N$
$p_k$ is the observed probability for examples in bin $k$

Here, the binning is required because we do not typically have a "true" probability for each example $p_i$, only a label $y_i$. Thus, we construct $K$ bins (e.g., $B_1 = [0,0.1)$, $B_2 = [0.1,0.2)$...), and then estimate the observed probabilities for each bin as:

$$\hat{p}_k = \frac{1}{|B_k|}\sum_{i\in B_k} 1[y_i=1]$$

CAL is an intuitive summary statistic, but it does have several shortcomings. In particular:

Since CAL is weighted by the number of observations, CAL can fudge local calibration issues over the full reliability diagram. If, for instance, 95% of your observations could fall into the first bin $\hat{p}_i \in [0,0.05)$ where you predict well... However, you could be completely off in the remaining cases.
CAL depends on the binning procedure. This is why some people use a smoothed estimate (e.g., Caruana and Niculescu Mizil). This is not true in settings where classifiers output a discrete set of predictions (e.g., for risk scores)

Best Answer

Related Solutions

Solved – Visualizing mixed model results

Classification – How to Measure Performance for Calibration in Binary Classification Problems

Related Question