Suppose I have a predictive model that produces, for each instance, a probability for each class. Now I recognize that there are many ways to evaluate such a model if I want to use those probabilities for classification (precision, recall, etc.). I also recognize that an ROC curve and the area under it can be used to determine how well the model differentiates between classes. Those are not what I'm asking about.
I'm interested in assessing the calibration of the model. I know that a scoring rule like the Brier score can be useful for this task. That's OK, and I'll likely incorporate something along those lines, but I'm not sure how intuitive such metrics will be for the lay person. I'm looking for something more visual. I want the person interpreting results to be able to see whether or not when the model predicts something is 70% likely to happen that it actually happens ~70% of the time, etc.
I heard of (but never used) Q-Q plots, and at first I thought this was what I was looking for. However, it seems that is really meant for comparing two probability distributions. That's not directly what I have. I have, for a bunch of instances, my predicted probability and then whether the event actually occurred:
Index P(Heads) Actual Result
1 .4 Heads
2 .3 Tails
3 .7 Heads
4 .65 Tails
... ... ...
So is a Q-Q plot really what I want, or am I looking for something else? If a Q-Q plot is what I should be using, what is the correct way to transform my data into probability distributions?
I imagine I could sort both columns by predicted probability and then create some bins. Is that the type of thing I should be doing, or am I off in my thinking somewhere? I'm familiar with various discretization techniques, but is there a specific way of discretizing into bins that is standard for this sort of thing?
Best Answer
Your thinking is good.
John Tukey recommended binning by halves: split the data into upper and lower halves, then split those halves, then split the extreme halves recursively. Compared to equal-width binning, this allows visual inspection of tail behavior without devoting too many graphical elements to the bulk of the data (in the middle).
Here is an example (using R) of Tukey's approach. (It's not exactly the same: he implemented
mletter
a little differently.)First, let's create some predictions and some outcomes that conform to those predictions:
The plot is not very informative, because all the
actual
values are, of course, either $0$ (did not occur) or $1$ (did occur). (It appears as the background of gray open circles in the first figure below.) This plot needs smoothing. To do so, we bin the data. Functionmletter
does the splitting-by-halves. Its first argumentr
is an array of ranks between 1 andn
(the second argument). It returns unique (numeric) identifiers for each bin:Using this, we bin both the predictions and the outcomes and average each within each bin. Along the way, we compute bin populations:
To symbolize the plot effectively we should make the symbol areas proportional to bin counts. It can be helpful to vary the symbol colors a little, too, whence:
With these in hand, we now enhance the preceding plot:
As an example of a poor prediction, let's change the data:
Repeating the analysis produces this plot in which the deviations are clear:
This model tends to be overoptimistic (average outcome for predictions in the 50% to 90% range are too low). In the few cases where the prediction is low (less than 30%), the model is too pessimistic.