Optimization Techniques – Reconciling Log-Likelihood and Brier Score

likelihoodoptimizationscoring-rules

Both log-likelihood and Brier score are proper scoring rules. As such, they reach the optimum when the predicted probabilities match the true ones. Since there is only one true probability for each predictor vector ($\textbf{x}$), minimising either of the two (or, for that matter, any other proper scoring rule) should lead to the true model – assuming the model is correctly specified, right? And if the model is not correctly specified, no scoring rule can lead to the true model, anyway.

So, if my reasoning above is correct, the optima for log-likelihood and Brier score should coincide, despite them having completely different forms. For example, log-likelihood can go into infinity, while Brier score plateaus. If they don't coincide, it indicates that the model was not correct. Is there anything to learn about the mismatch between the model and the data from analysing the differences between the models obtained by optimising for different scores?

But, isn't optimising Brier score simply least squares on a function bound to $[0, 1]$? Shouldn't we avoid least squares for probabilities, due to heteroscedasticity (and non-normality of the errors)? What am I missing?

Best Answer

Although both log-loss and Brier scores provide proper scoring rules, they put emphasis on different regions of the probability distributions. Quoting from Wikipedia:

the choice of a scoring rule corresponds to an assumption about the probability distribution of decision problems for which the predicted probabilities will ultimately be employed, with for example the quadratic loss (or Brier) scoring rule corresponding to a uniform probability of the decision threshold being anywhere between zero and one.

In contrast, log-loss puts a lot of emphasis on probability extremes. As you recognize, a finite data set will not give you a completely "true" model, so the choice among scoring rules is best determined by the probability cutoff region that you will use in practice, which in turn is determined by the relative costs of different types of misclassification (if you ultimately need to make classification decisions). See this page and its links for more detail.