Hastie and Tibshirani mention in section 4.3.2 of their book that in the
linear regression setting, the least squares approach is in fact a special case
of maximum likelihood. How can we prove this result?
PS: Spare no mathematical details.
least squaresmaximum likelihoodregression
Hastie and Tibshirani mention in section 4.3.2 of their book that in the
linear regression setting, the least squares approach is in fact a special case
of maximum likelihood. How can we prove this result?
PS: Spare no mathematical details.
I'd like to provide a straightforward answer.
What is the main difference between maximum likelihood estimation (MLE) vs. least squares estimation (LSE) ?
As @TrynnaDoStat commented, minimizing squared error is equivalent to maximizing the likelihood in this case. As said in Wikipedia,
In a linear model, if the errors belong to a normal distribution the least squares estimators are also the maximum likelihood estimators.
they can be viewed as almost the same in your case since the conditions of the least square methods are these four: 1) linearity; 2) linear normal residuals; 3) constant variability/homoscedasticity; 4) independence.
Let me detail it a bit. Since we know that the response variable $y$.
$$y=w^T X +\epsilon \quad\text{ where }\epsilon\thicksim N(0,\sigma^2)$$
follows a normal distribution(normal residuals),
$$P(y|w, X)=\mathcal{N}(y|w^TX, \sigma^2I)$$
then the likelihood function(independence) is,
\begin{align} L(y^{(1)},\dots,y^{(N)};w, X^{(1)},\dots,X^{(N)}) &= \prod_{i=1}^N \mathcal{N}(y^{(i)}|w^TX^{(i)}, \sigma^2I) \\ &= \frac{1}{(2\pi)^{\frac{N}{2}}\sigma^N}exp(\frac{-1}{2\sigma^2}(\sum_{i=1}^N(y^{(i)}-w^TX^{(i)})^2)). \end{align}
Maximizing L is equivalent to minimizing(since other stuff are all constants, homoscedasticity) $$\sum_{i=1}^n(y^{(i)}-w^TX^{(i)})^2.$$ That's the least-squares method, the difference between the expected $\hat{Y_i}$ and the actual $Y_i$.
Why can't we use MLE for predicting $y$ values in linear regression and vice versa?
As explained above we're actually(more precisely equivalently) using the MLE for predicting $y$ values. And if the response variable has arbitrary distributions rather than the normal distribution, like Bernoulli distribution or anyone from the exponential family we map the linear predictor to the response variable distribution using a link function(according to the response distribution), then the likelihood function becomes the product of all the outcomes(probabilities between 0 and 1) after the transformation. We can treat the link function in the linear regression as the identity function(since the response is already a probability).
They are talking about the same thing. They simply used different notations and one is a particular case of the other one.
I'll start with The Elements of Statistical Learning which is the general case. We have:
$$\hat{\beta} = (X^TX)^{-1}X^Ty$$
Here $\hat{\beta}$ is a vector of the form $(\hat{\beta_1},\hat{\beta_2},..\hat{\beta_p})$ and is the vector of fitted coefficients for a linear regression with $p$ variables, including intercept. We also have $X$ the design matrix having each $x_i$ as columns and $y$ the vector with the independent variable. Those equations are well known and sometimes are named normal equations.
Let's move to the ITSL book. The exposition from there discuss a particular case of a multivariate linear regression. Specifically, it describes the linear regression with a single dependent variable and an intercept. That means in out case the design matrix $X$ has two columns: the intercept (all ones) and the single dependent variable $x$. So, $X = \begin{bmatrix}1 &x\end{bmatrix}$. Also we have $\hat{\beta}$ is the vector of the two fitted model parameters, so $\hat{\beta}=\begin{bmatrix}\hat{\beta_0} & \hat{\beta_1}\end{bmatrix}^T$ or in your notation $\begin{bmatrix}\hat{B_0} & \hat{B_1}\end{bmatrix}^T$. I will use beta instead of B, since I am more comfortable with it.
A preliminary calculus shows us that:
$$\begin{bmatrix}1 &x\end{bmatrix}^T \begin{bmatrix}1 &x\end{bmatrix} = \begin{bmatrix}n & \sum x \\ \sum x & n\end{bmatrix} = n\begin{bmatrix}1 & \bar{x}\\ \bar{x} & \frac{x^Tx}{n}\end{bmatrix}$$
Here we used the fact that:
$$\begin{bmatrix}1 &x\end{bmatrix}^T \begin{bmatrix}1 &x\end{bmatrix}=\begin{bmatrix}1 & 1 &.. & 1 \\ x_1 & x_2 & .. & x_n\end{bmatrix}\begin{bmatrix}1 & x_1 \\ 1 & x_2 \\ .. & .. \\1 & x_n\end{bmatrix} = \begin{bmatrix}n & \sum x\\\sum x & x^Tx\end{bmatrix}= n\begin{bmatrix}1 & \bar{x} \\ \bar{x} & \frac{x^Tx}{n}\end{bmatrix}$$
Considering that we now have the normal equations for your particular case as
$$\begin{bmatrix}\hat{\beta_0} \\ \hat{\beta_1} \end{bmatrix} = (n\begin{bmatrix}1 & \bar{x}\\ \bar{x} & \frac{x^Tx}{n}\end{bmatrix})^{-1} \begin{bmatrix}1 & x\end{bmatrix}^T y$$
Notice is not easy to invert in formula the covariance matrix, so we will multiply with that matrix on the right to get rid of the inverse. Thus we will obtain:
$$n\begin{bmatrix}1 & \bar{x}\\ \bar{x} & \frac{x^Tx}{n}\end{bmatrix} \begin{bmatrix}\hat{\beta_0} \\ \hat{\beta_1} \end{bmatrix} = \begin{bmatrix}1 & x\end{bmatrix}^T y$$
Moving $n$ to the right we have
$$\begin{bmatrix}1 & \bar{x}\\ \bar{x} & \frac{x^Tx}{n}\end{bmatrix} \begin{bmatrix}\hat{\beta_0} \\ \hat{\beta_1} \end{bmatrix} = \begin{bmatrix}\bar{y} \\ \frac{x^Ty}{n}\end{bmatrix}$$
What we obtained right now is a system of two equations, so both can be used. The first equation is what you already have:
$$\hat{\beta_0}-\bar{x}\hat{\beta_1} = \bar{y}$$
I am convinced that the second equations after some substitutions looks the way you saw it in the book.
As a conclusion the ISTL talks about a particular case and all the beta coefficients are scalars, and the other description works for generic case and beta from there is a vector of coefficients. Hope that helped.
Best Answer
The linear regression model
$Y = X\beta + \epsilon$, where $\epsilon \sim N(0,I\sigma^2)$
$Y \in \mathbb{R}^{n}$, $X \in \mathbb{R}^{n \times p}$ and $\beta \in \mathbb{R}^{p}$
Note that our model error (residual) is ${\bf \epsilon = Y - X\beta}$. Our goal is to find a vector of $\beta$s that minimize the $L_2$ norm squared of this error.
Least Squares
Given data $(x_1,y_1),...,(x_n,y_n)$ where each $x_{i}$ is $p$ dimensional, we seek to find:
$$\widehat{\beta}_{LS} = {\underset \beta {\text{argmin}}} ||{\bf \epsilon}||^2 = {\underset \beta {\text{argmin}}} ||{\bf Y - X\beta}||^2 = {\underset \beta {\text{argmin}}} \sum_{i=1}^{n} ( y_i - x_{i}\beta)^2 $$
Maximum Likelihood
Using the model above, we can set up the likelihood of the data given the parameters $\beta$ as:
$$L(Y|X,\beta) = \prod_{i=1}^{n} f(y_i|x_i,\beta) $$
where $f(y_i|x_i,\beta)$ is the pdf of a normal distribution with mean 0 and variance $\sigma^2$. Plugging it in:
$$L(Y|X,\beta) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y_i - x_i\beta)^2}{2\sigma^2}}$$
Now generally when dealing with likelihoods its mathematically easier to take the log before continuing (products become sums, exponentials go away), so let's do that.
$$\log L(Y|X,\beta) = \sum_{i=1}^{n} \log(\frac{1}{\sqrt{2\pi\sigma^2}}) -\frac{(y_i - x_i\beta)^2}{2\sigma^2}$$
Since we want the maximum likelihood estimate, we want to find the maximum of the equation above, with respect to $\beta$. The first term doesn't impact our estimate of $\beta$, so we can ignore it:
$$ \widehat{\beta}_{MLE} = {\underset \beta {\text{argmax}}} \sum_{i=1}^{n} -\frac{(y_i - x_i\beta)^2}{2\sigma^2}$$
Note that the denominator is a constant with respect to $\beta$. Finally, notice that there is a negative sign in front of the sum. So finding the maximum of a negative number is like finding the minimum of it without the negative. In other words:
$$ \widehat{\beta}_{MLE} = {\underset \beta {\text{argmin}}} \sum_{i=1}^{n} (y_i - x_i\beta)^2 = \widehat{\beta}_{LS}$$
Recall that for this to work, we had to make certain model assumptions (normality of error terms, 0 mean, constant variance). This makes least squares equivalent to MLE under certain conditions. See here and here for more discussion.
For completeness, note that the solution can be written as:
$${\bf \beta = (X^TX)^{-1}X^Ty} $$