Consider a binary classification problem, where the goal is to use training data $(x_i,y_i)_{i=1}^n$ to fit a classifier $f: \mathbb{R}^d \rightarrow [0,1]$ that outputs a conditional probability estimate (e.g. $f$ could be a logistic regression model).
The general way to check if the predicted probabilities match the true probabilities (i.e., are “well-calibrated") seems to be a reliability plot. This plots the probabilities on the x-axis, and the observed probabilities on the y-axis.
I am looking for a performance metric that could be used instead of the reliability plot? Ideally, I'd like to find a metric that is used in the statistics or ML literature.
Best Answer
I ended up finding several measures in the literature (see e.g. CAL and MXE in the the paper Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria by Caruana and Niculescu Mizil).
The most useful measure appears to be the Mean Calibration Error (CAL), which is the weighted root-mean squared error (RMSE) between predicted probabilities and true probabilities on a calibration plot. Formally:
$$\text{CAL} = \frac{1}{N} \sqrt{\sum_{k=1}^K \sum_{i \in B_k} {(p_k - \hat{p}_i})^2}$$
where:
Here, the binning is required because we do not typically have a "true" probability for each example $p_i$, only a label $y_i$. Thus, we construct $K$ bins (e.g., $B_1 = [0,0.1)$, $B_2 = [0.1,0.2)$...), and then estimate the observed probabilities for each bin as:
$$\hat{p}_k = \frac{1}{|B_k|}\sum_{i\in B_k} 1[y_i=1]$$
CAL is an intuitive summary statistic, but it does have several shortcomings. In particular:
Since CAL is weighted by the number of observations, CAL can fudge local calibration issues over the full reliability diagram. If, for instance, 95% of your observations could fall into the first bin $\hat{p}_i \in [0,0.05)$ where you predict well... However, you could be completely off in the remaining cases.
CAL depends on the binning procedure. This is why some people use a smoothed estimate (e.g., Caruana and Niculescu Mizil). This is not true in settings where classifiers output a discrete set of predictions (e.g., for risk scores)