Machine Learning – Using Brier Score for Extreme Class Imbalance

classificationmachine learningscoring-rulesunbalanced-classes

Since I've heard about proper scoring rules for binary classification like the Brier score or Log Loss, I am more and more convinced that they are drastically underrepresented in practice in favor of measures like accurary, ROC AUC or F1. As I want to drive forward a shift to proper scoring rules for model comparison in my organization, there is one common argument that I cannot fully answer:

If there is extreme class imbalance (e.g. 5 positive cases vs 1,000 negative cases), how does the Brier score ensure that we select the model that gives us the best performance regarding high probability forecasts for the 5 positive cases? As we do not care if the negative cases have predictions near 0 or 0.5 as long as they are relatively lower than those for the positive classes.

I have two possible answers available right now but would love to hear expert opinions on this topic:

1."The Brier score as a proper scoring rule gives rare events the appropriate weight that they should have on the performance evaluation. Discriminative power can further be examined with ROC AUC."

This follows the logic of Frank Harrell's comment to a related question: "Forecasts of rare events have the "right" effect on the mean, i.e., mean predicted probability of the event = overall proportion of events. The Brier score works no matter what the prevalence of events." As he further suggests there, one could supplement the Brier score with ROC AUC to examine to which extent the desired relative ranking of positive over negative cases was achieved.

2."We can use stratified Brier score to equally weight the forecast performance regarding each class."

This follows the logic of this papers argumentation: "Averaging the Brier score of all the classes gives the stratified Brier score. The stratified Brier score is more appropriate when there is class imbalance since it gives equal importance to all the classes and thus allows any miscalibration of the minority classes to be spotted.". I am not sure whether the loss of the strictly proper scoring rule property is worth the heavier weighting of the minority class of interest and whether there is a statistical sound foundation to use this somehow arbitrary way of reweighting ("If we follow this approach, what stops us from going further and weighting the minority class 2, 17, or 100 times as much as the other class?").

Best Answer

If there is extreme class imbalance (e.g. 5 positive cases vs 1,000 negative cases), how does the Brier score ensure that we select the model that gives us the best performance regarding high probability forecasts for the 5 positive cases? As we do not care if the negative cases have predictions near 0 or 0.5 as long as they are relatively lower than those for the positive classes.

This depends crucially on whether we can separate subpopulations with different class probabilities based on predictors. As an extreme example, if there are no (or no useful) predictors, then predicted probabilities for all instances will be equal, and requiring lower predictions for negative vs. positive classes makes no sense, whether we are looking at Brier scores or other loss functions.

Yes, this is rather obvious. But we need to keep it in mind.

So let's look at the second simplest case. Assume we have a predictor that separates our population cleanly into two subpopulations. Among subpopulation 1, there are 4 positive and 200 negative cases. Among subpopulation 2, there is 1 positive and 800 negative cases. (The numbers match your example.) And again, there is zero possibility of further subdividing the subpopulations.

Then we will get constant predicted probabilities to belong to the positive class $p_1$ for subpopulation 1 and $p_2$ for subpopulation 2. The Brier score then is

$$ \frac{1}{5+1000}\big(4(1-p_1)^2+200p_1^2+1(1-p_2)^2+800p_2^2\big). $$

Using a little calculus, we find that this is optimized by

$$ p_1 = \frac{1}{51} \quad\text{and}\quad p_2=\frac{1}{801}, $$

which are precisely the proportions of positive classes in the two subpopulations. Which in turn is as it should be, because this is what the Brier score being proper means.

And there you have it. The Brier score, being proper, will be optimized by the true class membership probabilities. If you have predictors that allow you to identify subpopulations or instances with a higher true probability, then the Brier score will incentivize you to output these higher probabilities. Conversely, if you can't identify such subpopulations, then the Brier score can't help you - but neither can anything else, simply because the information is not there.

However, the Brier score will not help you in overestimating the probability in subpopulation 1 and in underestimating the probability in subpopulation 2 beyond the true values $p_1=\frac{1}{51}$ and $p_2=\frac{1}{801}$, e.g., because "there are more positive cases in subpopulation 1 than in 2". Yes, that is so, but what use would over-/underestimating this value be? We already know about the differential based on the differences in $p_1$ and $p_2$, and biasing these will not serve us at all.

In particular, there is nothing an ROC analysis can help you with beyond finding an "optimal" threshold (which I pontificate on here). And finally, there is nothing in this analysis that depends in any way on classes being balanced or not, so I argue that unbalanced datasets are not a problem.

Finally, this is why I don't see the two answers you propose as useful. The Brier score helps us get at true class membership probabilities. What we then do with these probabilities will depend on our cost structure, and per my post on thresholds above, that is a separate problem. Yes, depending on this cost structure, we may end up with an algebraically reformulated version of a stratified Brier score, but keeping the statistical and the decision theoretic aspect separate keeps the process much cleaner.