Solved – Probability calibration metric for multiclass classifier

calibrationmachine learningmulti-classprobability

A machine learning classifier can be calibrated so that when the probability that datapoint i is of class A is 0.6, this is true 60% of the time.

In the binary class setting, this can be visualised with a reliability curve, or measured with a metric like Mean Calibration Error, which is the weighted root-mean squared error between predicted probabilities and true probabilities on a calibration plot (see here).

My question is, how do you extend this to the multiclass setting. Clearly it can't be visualised, but is a reliability curve for each class appropriate? Or does this depend on the classifier being used (for example I'm using an SVM with OVA). Is Brier Score or Log-loss the best way to go (the volatility of log-loss puts me off a bit), or is it possible (how?) to extend Mean Calibration Error to multiclass (another possibility is CAL, defined here)

Best Answer

Following Guo et al., I ended up using the Expected Calibration Error, defined as $$\sum_{m=1}^M\frac{|{B_{m}|}}{n}\left|acc(B_m) - conf(B_m)\right|$$

In extending this to multiclass, one can either take the maximum probability for each prediction, or average across the top $n$ predictions, if desired.

Related Solutions

Solved – Binary classification: single label probability based metric/calibration

I recommend you look into cost curves. These (shown on the right of the figure below) display the normalized expected cost (i.e., error) at different probability costs (i.e., class probability or cost function). This will not give a single score necessarily but will show the range of performance.

Drummond, C., & Holte, R. C. (2006). Cost curves: An improved method for visualizing classifier performance. Machine Learning, 65(1), 95–130.

Solved – Probability Calibration messes Reliability

The short answer is that your reliability graph is not actually worse. It just looks worse because the bins in the center of distribution have very few points, so the (empirical) probability jumps around.

I suggest using the ml_insights package for calibration. (Disclaimer: I am a primary author of the package). It has a function plot_reliability_diagram which lets you easily control the bins used, as well as a (default) option displaying the points sized by the number of points in the bin.

So, in your terminal, run pip install ml_insights

Then try the following code at the end of the code you posted above.

mli.plot_reliability_diagram(y_test,probs_sigmoid[:,1], c='blue')

But if we use fewer bins, we see a clearer picture:

mli.plot_reliability_diagram(y_test,sigmoid[:,1], bins=[0,.1,.2,.4,.6,.8,.9,1], c='blue')

While you are at it, you can also play around with the Spline Calibration function we developed.

    spline_calib = mli.SplineCalibratedClassifierCV(rfc)
    spline_calib.fit(X_train,y_train)
    spline_probs = spline_calib.predict_proba(X_test)

It typically gives better results than either the sigmoid or isotonic methods of calibration.

Please let me know if this helps. We are actively working on developing and improving this package.

Best Answer

Related Solutions

Solved – Binary classification: single label probability based metric/calibration

Solved – Probability Calibration messes Reliability

Related Question