Solved – Relationship between MLE and least squares in case of linear regression

least squaresmaximum likelihoodregression

Hastie and Tibshirani mention in section 4.3.2 of their book that in the
linear regression setting, the least squares approach is in fact a special case
of maximum likelihood. How can we prove this result?

PS: Spare no mathematical details.

Best Answer

The linear regression model

$Y = X\beta + \epsilon$, where $\epsilon \sim N(0,I\sigma^2)$

$Y \in \mathbb{R}^{n}$, $X \in \mathbb{R}^{n \times p}$ and $\beta \in \mathbb{R}^{p}$

Note that our model error (residual) is ${\bf \epsilon = Y - X\beta}$. Our goal is to find a vector of $\beta$s that minimize the $L_2$ norm squared of this error.

Least Squares

Given data $(x_1,y_1),...,(x_n,y_n)$ where each $x_{i}$ is $p$ dimensional, we seek to find:

$$\widehat{\beta}_{LS} = {\underset \beta {\text{argmin}}} ||{\bf \epsilon}||^2 = {\underset \beta {\text{argmin}}} ||{\bf Y - X\beta}||^2 = {\underset \beta {\text{argmin}}} \sum_{i=1}^{n} ( y_i - x_{i}\beta)^2 $$

Maximum Likelihood

Using the model above, we can set up the likelihood of the data given the parameters $\beta$ as:

$$L(Y|X,\beta) = \prod_{i=1}^{n} f(y_i|x_i,\beta) $$

where $f(y_i|x_i,\beta)$ is the pdf of a normal distribution with mean 0 and variance $\sigma^2$. Plugging it in:

$$L(Y|X,\beta) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y_i - x_i\beta)^2}{2\sigma^2}}$$

Now generally when dealing with likelihoods its mathematically easier to take the log before continuing (products become sums, exponentials go away), so let's do that.

$$\log L(Y|X,\beta) = \sum_{i=1}^{n} \log(\frac{1}{\sqrt{2\pi\sigma^2}}) -\frac{(y_i - x_i\beta)^2}{2\sigma^2}$$

Since we want the maximum likelihood estimate, we want to find the maximum of the equation above, with respect to $\beta$. The first term doesn't impact our estimate of $\beta$, so we can ignore it:

$$ \widehat{\beta}_{MLE} = {\underset \beta {\text{argmax}}} \sum_{i=1}^{n} -\frac{(y_i - x_i\beta)^2}{2\sigma^2}$$

Note that the denominator is a constant with respect to $\beta$. Finally, notice that there is a negative sign in front of the sum. So finding the maximum of a negative number is like finding the minimum of it without the negative. In other words:

$$ \widehat{\beta}_{MLE} = {\underset \beta {\text{argmin}}} \sum_{i=1}^{n} (y_i - x_i\beta)^2 = \widehat{\beta}_{LS}$$

Recall that for this to work, we had to make certain model assumptions (normality of error terms, 0 mean, constant variance). This makes least squares equivalent to MLE under certain conditions. See here and here for more discussion.

For completeness, note that the solution can be written as:

$${\bf \beta = (X^TX)^{-1}X^Ty} $$

Related Question