Solved – Probability calibration metric for multiclass classifier

calibrationmachine learningmulti-classprobability

A machine learning classifier can be calibrated so that when the probability that datapoint i is of class A is 0.6, this is true 60% of the time.

In the binary class setting, this can be visualised with a reliability curve, or measured with a metric like Mean Calibration Error, which is the weighted root-mean squared error between predicted probabilities and true probabilities on a calibration plot (see here).

My question is, how do you extend this to the multiclass setting. Clearly it can't be visualised, but is a reliability curve for each class appropriate? Or does this depend on the classifier being used (for example I'm using an SVM with OVA). Is Brier Score or Log-loss the best way to go (the volatility of log-loss puts me off a bit), or is it possible (how?) to extend Mean Calibration Error to multiclass (another possibility is CAL, defined here)

Best Answer

Following Guo et al., I ended up using the Expected Calibration Error, defined as $$\sum_{m=1}^M\frac{|{B_{m}|}}{n}\left|acc(B_m) - conf(B_m)\right|$$

In extending this to multiclass, one can either take the maximum probability for each prediction, or average across the top $n$ predictions, if desired.

Related Question