Scoring Rule vs. Loss Function – Why Use a Different Scoring Rule?

maximum likelihoodmodel-evaluationregressionscoring-rules

I guess my question is related to these ones: Choosing among proper scoring rules, The performance metric used in prediction is different from the objective function to train the model, but I'm still puzzled…

I've been taught that the proper way to perform regression is to find the parameters by maximum likelihood. This holds for linear regression as well as for probabilistic, e.g. logistic regression. Why, then, would I want to quantify the model's performance by anything else but the likelihood (or some monotonous transformation of it)? What would be the justification for using Brier score, spherical score, or whatever else?

This answer explains:

the choice of scoring rule comes down to how much weight you want to place on different portions of the probability scale or, equivalently, at different relative false-positive and false-negative costs.

I don't understand it. From what I've seen, Brier score is symmetric regarding false positives to false negatives. But, even if the explanation above were true and I really wanted to weight 'different portions of the probability scale' differently, shouldn't I then minimise this score in the first place, instead of maximising log-likelihood?

So, the way I see it, I can either

  • perform regression using max-likelihood and then stick to likelihood for evaluation; or
  • decide on the scoring rule based on my real-world application and then find the regression parameters by optimising that score, which I would later also use for evaluating the model.

(Note that in the latter case, if I'd choose Brier score, this would boil down to least squares optimisation. But least squares is actually max-likelihood in presence of Gaussian noise, which cannot be true for binary data…)

What am I missing?

Best Answer

One place where this is done all the time is when the loss function includes a penalty term. We train the model according to some function that has the penalty term, but then we evaluate without that penalty term. Ridge regression, for instance, does this.

For instance, the loss function for ridge regression is $L(y,\hat y\vert\lambda) = \sum\left(y_i - \hat y_i\right)^2 + \lambda\vert\vert\hat\beta\vert\vert_2 $, but the performance is typically evaluated on just square loss, $\sum\left(y_i - \hat y_i\right)^2$.