Machine Learning – Is Brier Score Appropriate for Comparing Different Classification Models?

accuracylogisticmachine learningscoring-rules

TL;DR: I am working with binary classifications. I have different models I want to compare their performance out of the box. I read that accuracy is a poor metric, and Brier score or log loss should be used instead. However, I also read that the Brier score should not be used when comparing logistic regression vs. random forest, and it should be mainly used as a metric when tuning/changing the parameters of a single model. Is this statement true? Is it wrong to use Brier to compare the performance of different models/approaches?


Full background to my research question:

Hi all,

I have a dataset composed of two groups (disease type 1 vs. type 2) and 50 samples per group. For each sample, I have around 7000 features being measured. Importantly, identifying type 2 is key, and I am willing to "pay the price" of getting some type 1 as false positives.

My initial plan was to run feature selection and machine learning to classify these groups. After reading a bunch of stuff here, I realize that my approach may not be ideal for my dataset. For instance, ML with 100 samples is far from ideal. In addition, my dataset is 50/50 while the real-world prevalence of both disease types is 70/30; thus, any model I come up with will most likely underperform in the future.

I am aware of these limitations (and there are probably many more), but since the data is already in my hands right now, I wish I could "play" with it to see what I can get. I plan to run repeated k-fold cross-validation (10-fold with 10 repetitions). Inside each fold, I am performing mRMR (feature selection) and a few classification models. For example, logistic regression, random forest, SVM, XGBoost, and a few more. I want to compare the performance of each model and then spend more time optimizing the one that performed the best out of the box.

At first, I was going to compare log reg and the ML models using accuracy, but great posts by Frank Harrell, Stephan Kolassa, and others are changing my mind. Right now, I am planning to use Brier Score, at least in this initial stage where an overall screening is needed. However, I read that the Brier Score should not be used to compare logistic regression vs. random forest, as they are two different models. It seemed like Brier score should be used only for the same model under different parameters, for example, when evaluating the gains for hyperparameter tuning. How much of that is actually true?

Best Answer

Brier score might not be the statistic of interest for a particular task. In that case, Brier score would not be appropriate for comparing a logistic regression and a random forest, but that is because Brier score simply is not the right value to calculate, rather than anything specific to how the values the Brier score evaluates are calculated or estimated.

However, if Brier score is what interests you, do calculate it. As long as the inputs to the score are appropriate (have an interpretation as probabilities, so not the log-odds output of a logistic regression or the predicted category that you can get by prediction methods in random forest software), go for it.

If there is an objection to doing this because random forests often give probability values that lack calibration and the Brier score will penalize this, that seems like a feature, not a bug, of Brier score. (Or maybe you don’t care about calibration, but then Brier score should not be your statistic of interest.)

If there is an objection to calculating the Brier score of a model because the model was not optimized well (mentioned in the comments), that seems like an admission that the model is not very good. If a model is making poor predictions (in terms of Brier score) because it was not optimized well, the key part of that to me is that the model is making poor predictions.

Related Question