Solved – Maximum likelihood method vs. least squares method

estimationleast squaresmaximum likelihoodregression

What is the main difference between maximum likelihood estimation (MLE) vs. least squares estimaton (LSE) ?

Why can't we use MLE for predicting $y$ values in linear regression and vice versa?

Any help on this topic will be greatly appreciated.

Best Answer

I'd like to provide a straightforward answer.

What is the main difference between maximum likelihood estimation (MLE) vs. least squares estimation (LSE) ?

As @TrynnaDoStat commented, minimizing squared error is equivalent to maximizing the likelihood in this case. As said in Wikipedia,

In a linear model, if the errors belong to a normal distribution the least squares estimators are also the maximum likelihood estimators.

they can be viewed as almost the same in your case since the conditions of the least square methods are these four: 1) linearity; 2) linear normal residuals; 3) constant variability/homoscedasticity; 4) independence.

Let me detail it a bit. Since we know that the response variable $y$. $$y=w^T X +\epsilon \quad\text{ where }\epsilon\thicksim N(0,\sigma^2)$$ follows a normal distribution(normal residuals),
$$P(y|w, X)=\mathcal{N}(y|w^TX, \sigma^2I)$$
then the likelihood function(independence) is,

\begin{align} L(y^{(1)},\dots,y^{(N)};w, X^{(1)},\dots,X^{(N)}) &= \prod_{i=1}^N \mathcal{N}(y^{(i)}|w^TX^{(i)}, \sigma^2I) \\ &= \frac{1}{(2\pi)^{\frac{N}{2}}\sigma^N}exp(\frac{-1}{2\sigma^2}(\sum_{i=1}^N(y^{(i)}-w^TX^{(i)})^2)). \end{align}

Maximizing L is equivalent to minimizing(since other stuff are all constants, homoscedasticity) $$\sum_{i=1}^n(y^{(i)}-w^TX^{(i)})^2.$$ That's the least-squares method, the difference between the expected $\hat{Y_i}$ and the actual $Y_i$.

Why can't we use MLE for predicting $y$ values in linear regression and vice versa?

As explained above we're actually(more precisely equivalently) using the MLE for predicting $y$ values. And if the response variable has arbitrary distributions rather than the normal distribution, like Bernoulli distribution or anyone from the exponential family we map the linear predictor to the response variable distribution using a link function(according to the response distribution), then the likelihood function becomes the product of all the outcomes(probabilities between 0 and 1) after the transformation. We can treat the link function in the linear regression as the identity function(since the response is already a probability).

Related Question