If there is extreme class imbalance (e.g. 5 positive cases vs 1,000 negative cases), how does the Brier score ensure that we select the model that gives us the best performance regarding high probability forecasts for the 5 positive cases? As we do not care if the negative cases have predictions near 0 or 0.5 as long as they are relatively lower than those for the positive classes.
This depends crucially on whether we can separate subpopulations with different class probabilities based on predictors. As an extreme example, if there are no (or no useful) predictors, then predicted probabilities for all instances will be equal, and requiring lower predictions for negative vs. positive classes makes no sense, whether we are looking at Brier scores or other loss functions.
Yes, this is rather obvious. But we need to keep it in mind.
So let's look at the second simplest case. Assume we have a predictor that separates our population cleanly into two subpopulations. Among subpopulation 1, there are 4 positive and 200 negative cases. Among subpopulation 2, there is 1 positive and 800 negative cases. (The numbers match your example.) And again, there is zero possibility of further subdividing the subpopulations.
Then we will get constant predicted probabilities to belong to the positive class $p_1$ for subpopulation 1 and $p_2$ for subpopulation 2. The Brier score then is
$$ \frac{1}{5+1000}\big(4(1-p_1)^2+200p_1^2+1(1-p_2)^2+800p_2^2\big). $$
Using a little calculus, we find that this is optimized by
$$ p_1 = \frac{1}{51} \quad\text{and}\quad p_2=\frac{1}{801}, $$
which are precisely the proportions of positive classes in the two subpopulations. Which in turn is as it should be, because this is what the Brier score being proper means.
And there you have it. The Brier score, being proper, will be optimized by the true class membership probabilities. If you have predictors that allow you to identify subpopulations or instances with a higher true probability, then the Brier score will incentivize you to output these higher probabilities. Conversely, if you can't identify such subpopulations, then the Brier score can't help you - but neither can anything else, simply because the information is not there.
However, the Brier score will not help you in overestimating the probability in subpopulation 1 and in underestimating the probability in subpopulation 2 beyond the true values $p_1=\frac{1}{51}$ and $p_2=\frac{1}{801}$, e.g., because "there are more positive cases in subpopulation 1 than in 2". Yes, that is so, but what use would over-/underestimating this value be? We already know about the differential based on the differences in $p_1$ and $p_2$, and biasing these will not serve us at all.
In particular, there is nothing an ROC analysis can help you with beyond finding an "optimal" threshold (which I pontificate on here). And finally, there is nothing in this analysis that depends in any way on classes being balanced or not, so I argue that unbalanced datasets are not a problem.
Finally, this is why I don't see the two answers you propose as useful. The Brier score helps us get at true class membership probabilities. What we then do with these probabilities will depend on our cost structure, and per my post on thresholds above, that is a separate problem. Yes, depending on this cost structure, we may end up with an algebraically reformulated version of a stratified Brier score, but keeping the statistical and the decision theoretic aspect separate keeps the process much cleaner.
Best Answer
ROC curves are insensitive to class imbalance. This means is that the ROC curve will look the same when you change the class proportion of your dataset (besides statistical uncertainty of course).
That's because:
Therefore:
You can at the explanation of the confusion matrix on Wikipedia which should make all this clearer.
Now the ROC curve may be unaffected, but this is not the only way to measure a model's performance. Predictive values are affected by class imbalance. For instance the positive predictive value, that is given a positive prediction, what's the chance that the observation was actually positive; and negative predictive values, that is given a negative prediction, what's the chance that the observation was actually negative? These values are important when you want to apply your model to make decisions on unknown data. So you should make sure to calculate them on a dataset that is representative of the population on which the model will make the predictions eventually.