Classification – How to Measure Performance for Calibration in Binary Classification Problems

calibrationclassification

Consider a binary classification problem, where the goal is to use training data $(x_i,y_i)_{i=1}^n$ to fit a classifier $f: \mathbb{R}^d \rightarrow [0,1]$ that outputs a conditional probability estimate (e.g. $f$ could be a logistic regression model).

The general way to check if the predicted probabilities match the true probabilities (i.e., are “well-calibrated") seems to be a reliability plot. This plots the probabilities on the x-axis, and the observed probabilities on the y-axis.

I am looking for a performance metric that could be used instead of the reliability plot? Ideally, I'd like to find a metric that is used in the statistics or ML literature.

Best Answer

I ended up finding several measures in the literature (see e.g. CAL and MXE in the the paper Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria by Caruana and Niculescu Mizil).

The most useful measure appears to be the Mean Calibration Error (CAL), which is the weighted root-mean squared error (RMSE) between predicted probabilities and true probabilities on a calibration plot. Formally:

$$\text{CAL} = \frac{1}{N} \sqrt{\sum_{k=1}^K \sum_{i \in B_k} {(p_k - \hat{p}_i})^2}$$

where:

  • $\hat{p}_i$ is the predicted probability for example $i = 1,\ldots,N$
  • $p_k$ is the observed probability for examples in bin $k$

Here, the binning is required because we do not typically have a "true" probability for each example $p_i$, only a label $y_i$. Thus, we construct $K$ bins (e.g., $B_1 = [0,0.1)$, $B_2 = [0.1,0.2)$...), and then estimate the observed probabilities for each bin as:

$$\hat{p}_k = \frac{1}{|B_k|}\sum_{i\in B_k} 1[y_i=1]$$

CAL is an intuitive summary statistic, but it does have several shortcomings. In particular:

  • Since CAL is weighted by the number of observations, CAL can fudge local calibration issues over the full reliability diagram. If, for instance, 95% of your observations could fall into the first bin $\hat{p}_i \in [0,0.05)$ where you predict well... However, you could be completely off in the remaining cases.

  • CAL depends on the binning procedure. This is why some people use a smoothed estimate (e.g., Caruana and Niculescu Mizil). This is not true in settings where classifiers output a discrete set of predictions (e.g., for risk scores)

Related Question