Machine Learning – Why a Well-Calibrated Model Has Worse Brier Score Loss

calibrationclassificationlogisticmachine learningmathematical-statistics

I already referred this post.Don't mark this as duplicate.

I am working on a binary classification problem using algos like random forest, extra trees and logistic regression. dataset shape is 977, 6. class ratio is 77:23

In terms of our metric of interest f1, random forest seemed to do better followed by extra trees and then last is logistic regression

However, in terms of calibration, I see that logistic regression is well-calibrated (not surprised), followed by extra-trees and last is random forest.

But my question, why does logistic regression have higher brier score loss when compared to random forest (which doesn't have inherent calibration capability as log reg)?

Shouldn't the logistic regression brier score loss be the smallest, followed by extra trees and last is random forest?

Please find the graphs below

enter image description here

enter image description here

enter image description here

Best Answer

Brier score can be decomposed into measures of calibration and discrimination. Calibration describes the extent to which predicted probabilities align with true event occurrence. That is, if an event that is predicted to happen with probability $0.5$ actually happens $90\%$ of the time, the calibration is poor. Discrimination describes the extent to which model predictions for the two categories can be separated, and the Brier score does well here when the predicted distributions for the two categories are easy to separate (hence the relationship to the ROC AUC discussed in the link).

You have a poor Brier score despite good calibration. This must mean that the ability for a model to discriminate between the two categories is poor.