Logistic Regression – Using MSE Instead of Log-Loss in Logistic Regression

logisticmaximum likelihoodmseunbiased-estimator

Suppose we replace the loss function of the logistic regression (which is normally log-likelihood) with the MSE. That is, still have log odds ratio be a linear function of the parameters, but minimize the sum of squared differences between the estimated probability and the outcome (coded as 0 / 1):

$\log \frac p{1-p} = \beta_0 + \beta_1x_1 + … +\beta_nx_n$

and minimize $\sum(y_i – p_i)^2$ instead of $\sum [y_i \log p_i + (1-y_i) \log (1-p_i)]$.

Of course, I understand why log likelihood makes sense under some assumptions. But in machine learning, where assumptions are usually not made, what is the intuitive reason the MSE is completely unreasonable? (Or are there situations where MSE might make sense?).

Best Answer

The short answer is that likelihood theory exists to guide us towards optimum solutions, and maximizing something other than the likelihood, penalized likelihood, or Bayesian posterior density results in suboptimal estimators. Secondly, minimizing sum of squared errors leads to unbiased estimates of true probabilities. Here you do not desire unbiased estimates, because to have that estimates can be negative or greater than one. To properly constrain estimates requires one to get slightly biased estimates (towards the middle) in general, on the probability (not the logit) scale.

Don't believe that machine learning methods do not make assumptions. This issue has little to do with machine learning.

Note that an individual proportion is an unbiased estimate of the true probability, hence a binary logistic model with only an intercept provides an unbiased estimate. A binary logistic model with a single predictor that has $k$ mutually exclusive categories will provide $k$ unbiased estimates of probabilities. I think that a model that capitalizes on additivity assumptions and allows the user to request estimates outside the data range (e.g., a single predictor that is continuous) will have a small bias on the probability scale so as to respect the $[0,1]$ constraint.