Machine Learning – Why Use Squared Loss on Probabilities Instead of Logistic Loss?

classificationhidden markov modellogisticloss-functionsmachine learning

I am reading "Bayesian Knowledge Tracing" model fitting process. Model details can be found here. In short, it is a revised Hidden Markov Model applied on education application.

I have some questions about the code posted by the author here (from Columbia University), Where it seems the author uses squared loss on probability to check how good the fit was. In the attached documents the author says:

LikelihoodCorrect is calculated, for each student action. After that LikelihoodCorrect is subtracted from the studentAction and squared to get the Squared residual (SR) and then SR is summed to get SSR.

$$\text{likelihoodcorrect}=(\text{prevL} * (1.0-\text{Slip}))+ ((1.0-\text{prevL})* \text{Guess})$$
$$\text{SSR} +=(\text{StudentAction}-\text{likelihoodcorrect})^2$$

(In the data file the author provided, student action is a binary variable. So, this is 0 or 1 minus predicted probability, then squared.)

Should we use logistic loss instead?, which is

$$
y\log(p)+(1-y)\log(1-p)
$$

instead of

$$(y-p)^2$$

Why there are many publications using squared loss on binary variable instead of logistic loss? Such as this paper by Carnegie Mellon University, page 7 end of section 3.

All of the models were cross-validated using 10 randomly assigned user-stratified folds. For each of the cross-validation results we computed root mean squared error (RMSE) and accuracy (number of correctly predicted student successes and failures).

Best Answer

Squared loss on binary outcomes is called the Brier score. It's valid in the sense of being a "proper scoring rule", because you'll get the lowest mean squared error when you use the correct probability. In other words, logistic loss and squared loss have the same minimum.

This paper compares the properties of the Brier score ("square loss") to some other loss functions. They find that square loss/Brier score converges more slowly than logistic loss.

Square loss has some advantages that might compensate in some cases:

  • It's always finite (unlike logistic loss, which can be infinite if $p=1$ and $y=0$ or vice versa)
  • It accelerates as the size of the errors increases (so it's less likely to allow any wildly inaccurate predictions to slip through, compared to accuracy and absolute loss)
  • It's differentiable everywhere (unlike hinge loss and zero-one loss)
  • It's the most commonly implemented loss in software packages, so it might be the only option in some cases