Maximum Likelihood – Numerical Difference Between Sum of Squared Residuals and Likelihood

least squareslikelihoodmaximum likelihood

I previously asked a question that got labelled as duplicated because I did not explain it correctly. I should not have used the regression model as an example because I can see how, by using that as an example, one could think that it's a duplicate question. I am re-phrasing the question here.

Consider any likelihood where the likelihood can be written as a function of the residuals squared. Then, numerically speaking from an optimization standpoint, what is the difference between maximizing the likelihood and minimizing the sum of squares. My belief is that both methodologiues should lead to the same estimates, numerically. I agree that the resulting estimates may not be MLEs but why do some people program the likelihood as the objective function for numerical routines when the sum of the residuals squared should be equivalent and simpler, I think. Thanks.

Actually, I think the simplest way to ask the question is the following: When wanting to obtain numerical estimates based on a likelihood, is it ever wrong to just minimize the sum of the residuals squared. Thanks again.

Best Answer

"When wanting to obtain numerical estimates based on a likelihood, is it ever wrong to just minimize the sum of the residuals squared." --- Almost always. If the parameter appears in the likelihood function in a very particular way, then ML corresponds to least squares.

In particular, consider the simple case of a single location parameter, $\mu$.

For us to get least squares, we need to use as our estimate, $\hat{\mu}^\text{LS}$ the value of $\mu$ that minimizes $\sum (y_i-\mu)^2$.

So when it comes to maximizing likelihood where the data has density $f$, note that with independent observations, we want to find $\hat{\mu}^\text{ML}$, the value of $\mu$ that maximizes $\prod_i f(y_i;\mu)$. Let $g=\log f$. Then that's the same as maximizing $\sum_i g(y_i;\mu)$.

That's the same as minimizing $c-k \sum_i g(y_i;\mu)$ for any convenient positive $k$ and any convenient real $c$.

So as long as $c-k \log f(y;\mu)=(y-\mu)^2$ for some convenient $c$ and $k$, ML will be least squares.

Consequently $f(y;\mu)=e^{-\frac{1}{k}(y-\mu)^2+\frac{c}{k}}$ for some $k$ and $c$.

The normal density with mean $\mu$ and some given variance $\sigma^2$ is of this form (for suitable choice of constants - $k$ is a function of $\sigma^2$ and $c/k$ will also be a function of $\sigma$ that serves to normalize it to a density.

So we see in that simple case at least, that least squares estimate can be had by finding the ML estimate for a normal location parameter. Many more complicated situations (including regression) work in essentially identical fashion -- to get least squares to be ML, start with estimating location parameters for Gaussian distributed variables.

So if you pick something else for $f$, the MLE for the location parameter doesn't come out to be least squares.

As for the numerical difference: if you're comparing the sum of squares as a function of $\mu$ and $-2\log\mathcal{L(\mu)}$ at the univariate normal ... while their argmins coincide, the value of the functions at the argmins might differ numerically due to the $k$ and $c$ above, which depends on the variance and the sample size.

Consider any likelihood where the likelihood can be written as a function of the residuals squared. Then, numerically speaking from an optimization standpoint, what is the difference between maximizing the likelihood and minimizing the sum of squares

If by 'a function of the residuals squared', then if you mean some other $\mathcal{l}((y_i-\mu)^2)$ than a straight $\sum (y_i-\mu)^2$, then all sorts of possibilities exist.

In comments, whuber mentions $\sum_i \sqrt{(y_i-\mu)^2} = \sum |y_i-\mu|$, which is a function of squared residuals which is not least squares, but of course there are infinitely many other such functions that are not least squares, some of which may correspond to ML estimators.

Consider the location-scale family of $t_\nu$-distributions, for example. For simplicity, take the scale and $\nu$ to be fixed.

These also have likelihoods which are functions of the squared residuals, but least squares is not ML for them.

Related Question