Solved – Probability calibration from LightGBM model with class imbalance

boostingcalibrationpropensity-scorespythonscikit learn

I've made a binary classification model using LightGBM. The dataset was fairly imbalnced but I'm happy enough with the output of it but am unsure how to properly calibrate the output probabilities. The baseline score of the model from sklearn.dummy.DummyClassifier is:

dummy = DummyClassifier(random_state=54)

dummy.fit(x_train, y_train)

dummy_pred = dummy.predict(x_test)

dummy_prob = dummy.predict_proba(x_test)
dummy_prob = dummy_prob[:,1]

print(classification_report(y_test, dummy_pred))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98    132274
           1       0.02      0.02      0.02      2686

   micro avg       0.96      0.96      0.96    134960
   macro avg       0.50      0.50      0.50    134960
weighted avg       0.96      0.96      0.96    134960

The output of the model is below and am ok with the results:

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.95      0.97    132274
           1       0.27      0.96      0.42      2686

   micro avg       0.95      0.95      0.95    134960
   macro avg       0.63      0.95      0.70    134960
weighted avg       0.98      0.95      0.96    134960

I want to use the output probabilities so I thought I should look at how well the model is calibrated as tree based models can often be not calibrated very well. I used sklearn.calibration.calibration_curve to plot the curve:

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve

gb_y, gb_x = calibration_curve(y_test, rf_probs, n_bins=10)

plt.plot([0, 1], [0, 1], linestyle='--')
# plot model reliability
plt.plot(gb_x, gb_y, marker='.')
plt.show()

I Tried Platt scaling to the data, i.e. fitting a logistic to the validation set output probabilities and apply it to the test data. While it is more calibrated, the probabilities are restricted to a max of approx 0.4. I would like the output to have a good range, i.e. people with low and high probabilities.

Does anybody know about how I would go about this?

Best Answer

I would suggest not changing the (calibrated) predicted probabilities. Some further points:

While calibrated probabilities appearing "low" might be counter-intuitive, it might also be more realistic given the nature of the problem. Especially when operating in an imbalanced setting, predicting that a particular user/person has a very high absolute probability of being in the very rare positive class might be misleading/over-confident.
I am not 100% clear from your post how the calibration was done. Assuming we did repeated-CV $2$ times $5$-fold cross-validation: Within each of the 10 executions should use a separate say $K$-fold internal cross-validation with ($K-1$) folds for learning the model and $1$ for fitting the calibration map. Then $K$ calibrated classifiers are generated within each execution and the outputs of them are averaged to provide predictions on the test fold. (Platt's original paper Probabilities for SV Machines uses $K=3$ throughout but that is not a hard rule.)
Given we are calibrating the probabilities of our classifier it would make sense to use proper scoring rule metrics like Brier score, Continuous Ranked Probability Score (CRPS), Logarithmic score too (the latter assuming we do not have any $0$ or $1$ probabilities being predicted).
After we have decided the threshold $T$ for our probabilistic classifier, we are good to explain what it does. Indeeed, the risk classification might suggest to "treat any person with risk higher than $0.03$"; that is fine if we can relate it to the relevant misclassification costs. Similarly, if misclassification costs are unavailable, if we use a proper scoring rule like Brier, we are still good; we have calibrated probabilistic predictions, anyway.

Proposed Solution: Calibration

I have read the "Multi-class" part of your post, but since you are using One vs. All SVMs, i think you should reconsider solving the problem at the binary level. You could calibrate the single svms, so that the resulting output values are comparable.

Calibration methods for the binary SVMs (so that means also applicable in the one vs. all scenario) are Platt scaling¹ and Isontonic regression. A nice overview with python code is available here.

For your own use case you would then calibrate each OvA SVM separately and afterwards the calibrated outputs for a, b and c should be comparable.

What does calibration do here?

The key thing here is, that SVMs themselves, are not probabilistic. The output value you mentioned is usually a function of the classified point's distance to the hyperplane. So we are using a heuristic which has no further significance. The goal of this heuristic is that higher numbers are more likely to be the correct result.

You can measure the signifance of your output values using a reliability plot. I will cut short here but essentially you want your reliability curve to be as close as possible to the diagonal. The calibration adds another mapping of output values to calibrated output values. This can handle for example classifiers which have a bias towards high output values. Think of it as another translation step "Ok i got that really confident 0.9 from you classifier A, but i know you always are over-confident so let's make this a 0.5". So a 0.5 value of classifier A should be closer to a 0.5 value of classifier B in the end.

Keep in mind, when using calibration you have to work thoroughly as usual (train/dev/test set).

1. Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3), 61-74.

Solved – Probability Calibration messes Reliability

The short answer is that your reliability graph is not actually worse. It just looks worse because the bins in the center of distribution have very few points, so the (empirical) probability jumps around.

I suggest using the ml_insights package for calibration. (Disclaimer: I am a primary author of the package). It has a function plot_reliability_diagram which lets you easily control the bins used, as well as a (default) option displaying the points sized by the number of points in the bin.

So, in your terminal, run pip install ml_insights

Then try the following code at the end of the code you posted above.

mli.plot_reliability_diagram(y_test,probs_sigmoid[:,1], c='blue')

But if we use fewer bins, we see a clearer picture:

mli.plot_reliability_diagram(y_test,sigmoid[:,1], bins=[0,.1,.2,.4,.6,.8,.9,1], c='blue')

While you are at it, you can also play around with the Spline Calibration function we developed.

    spline_calib = mli.SplineCalibratedClassifierCV(rfc)
    spline_calib.fit(X_train,y_train)
    spline_probs = spline_calib.predict_proba(X_test)

It typically gives better results than either the sigmoid or isotonic methods of calibration.

Please let me know if this helps. We are actively working on developing and improving this package.

Best Answer

Related Solutions

Solved – Multi-class SVM Calibration

Proposed Solution: Calibration

What does calibration do here?

Solved – Probability Calibration messes Reliability

Related Question