"*When wanting to obtain numerical estimates based on a likelihood, is it ever wrong to just minimize the sum of the residuals squared.*" --- Almost always. If the parameter appears in the likelihood function in a very particular way, then ML corresponds to least squares.

In particular, consider the simple case of a single location parameter, $\mu$.

For us to get least squares, we need to use as our estimate, $\hat{\mu}^\text{LS}$ the value of $\mu$ that minimizes $\sum (y_i-\mu)^2$.

So when it comes to maximizing likelihood where the data has density $f$, note that with independent observations, we want to find $\hat{\mu}^\text{ML}$, the value of $\mu$ that maximizes $\prod_i f(y_i;\mu)$. Let $g=\log f$. Then that's the same as maximizing $\sum_i g(y_i;\mu)$.

That's the same as minimizing $c-k \sum_i g(y_i;\mu)$ for any convenient positive $k$ and any convenient real $c$.

So as long as $c-k \log f(y;\mu)=(y-\mu)^2$ for some convenient $c$ and $k$, ML will be least squares.

Consequently $f(y;\mu)=e^{-\frac{1}{k}(y-\mu)^2+\frac{c}{k}}$ for some $k$ and $c$.

The normal density with mean $\mu$ and some given variance $\sigma^2$ is of this form (for suitable choice of constants - $k$ is a function of $\sigma^2$ and $c/k$ will also be a function of $\sigma$ that serves to normalize it to a density.

So we see in that simple case at least, that least squares estimate can be had by finding the ML estimate for a normal location parameter. Many more complicated situations (including regression) work in essentially identical fashion -- to get least squares to be ML, start with estimating location parameters for Gaussian distributed variables.

So if you pick something else for $f$, the MLE for the location parameter doesn't come out to be least squares.

As for the numerical difference: if you're comparing the sum of squares as a function of $\mu$ and $-2\log\mathcal{L(\mu)}$ at the univariate normal ... while their argmins coincide, the value of the functions at the argmins might differ numerically due to the $k$ and $c$ above, which depends on the variance and the sample size.

Consider any likelihood where the likelihood can be written as a function of the residuals squared. Then, numerically speaking from an optimization standpoint, what is the difference between maximizing the likelihood and minimizing the sum of squares

If by 'a function of the residuals squared', then if you mean some other $\mathcal{l}((y_i-\mu)^2)$ than a straight $\sum (y_i-\mu)^2$, then all sorts of possibilities exist.

In comments, whuber mentions $\sum_i \sqrt{(y_i-\mu)^2} = \sum |y_i-\mu|$, which is a function of squared residuals which is not least squares, but of course there are infinitely many other such functions that are not least squares, some of which may correspond to ML estimators.

Consider the location-scale family of $t_\nu$-distributions, for example. For simplicity, take the scale and $\nu$ to be fixed.

These also have likelihoods which are functions of the squared residuals, but least squares is not ML for them.

Numerical stability is by far the most important reason for using the log-likelihood instead of the likelihood. That reason alone is more than enough to choose the log-likelihood over the likelihood. Another reason that jumps to mind is that if there is an analytical solution then it is often much easier to find with the log-likelihood.

The likelihood function is typically a product of likelihood contributions by each observation. Taking the derivative of that will quickly lead to an unmanageable number of cross-product terms due to the product rule. In principle it is possible, but I don't want to be the person to keep track of all those terms.

The log-likelihood transforms that product of individual contributions to a sum of contributions, which is much more manageable due to the sum rule.

## Best Answer

This is an

alternativeanswer: optimizers in statistical packages usually work byminimizingthe result of a function. If your function gives the likelihood value first it's more convenient to use logarithm in order to decrease the value returned by likelihood function. Then, since the log likelihood and likelihood function have the same increasing or decreasing trend, you canminimizethe negative log likelihood in order to actually perform themaximumlikelihood estimate of the function you are testing. See for example the`nlminb`

function in R here