Solved – least squares coefficient estimates in calculus and matrix calculus

least squaresregressionself-study

I am a beginner stat learner and recently I've been reading Introduction to Statistical Learning with Applications in R by Hastie and Tibshirani.

On linear regression, it says:

$$RSS=(y_1 -\hat{B_0} – \hat{B_1x_1})^2 + (y_2 -\hat{B_0} – \hat{B_1x_2})^2+…+(y_n -\hat{B_0} – \hat{B_1x_n})^2.$$

The least squares approach chooses $\hat{B_0}$ and $\hat{B_1}$ to minimize the RSS. Using some calculus, one can show that the minimizers are:

$$\hat{B_1} =\frac{\sum_{i=1}^n (x_i – \bar{x})(y_i – \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2},$$

$$\hat{B_0} = \bar{y}-\hat{B_1}\bar{x}.$$

In a different book, The Elements of Statistical Learning by the same authors, $\hat{\beta}$ is defined as:

$$\hat{\beta} = (X^T X)^{-1} X^Ty.$$

Are the $\hat{\beta}$ the same as $\hat{B_1}$?
I assumed they were the same until I realized that they don't seem to produce the same result.

$X^T y$ seems to match with $\sum_{i=1}^n (x_i – \bar{x})(y_i – \bar{y})$ and $(X^T X)^{-1}$ seems to match with $\sum_{i=1}^n (x_i-\bar{x})^2$ but I don't know why the algebraic solution subtracts the mean while the matrix computation doesn't.

Best Answer

They are talking about the same thing. They simply used different notations and one is a particular case of the other one.

I'll start with The Elements of Statistical Learning which is the general case. We have:

$$\hat{\beta} = (X^TX)^{-1}X^Ty$$

Here $\hat{\beta}$ is a vector of the form $(\hat{\beta_1},\hat{\beta_2},..\hat{\beta_p})$ and is the vector of fitted coefficients for a linear regression with $p$ variables, including intercept. We also have $X$ the design matrix having each $x_i$ as columns and $y$ the vector with the independent variable. Those equations are well known and sometimes are named normal equations.

Let's move to the ITSL book. The exposition from there discuss a particular case of a multivariate linear regression. Specifically, it describes the linear regression with a single dependent variable and an intercept. That means in out case the design matrix $X$ has two columns: the intercept (all ones) and the single dependent variable $x$. So, $X = \begin{bmatrix}1 &x\end{bmatrix}$. Also we have $\hat{\beta}$ is the vector of the two fitted model parameters, so $\hat{\beta}=\begin{bmatrix}\hat{\beta_0} & \hat{\beta_1}\end{bmatrix}^T$ or in your notation $\begin{bmatrix}\hat{B_0} & \hat{B_1}\end{bmatrix}^T$. I will use beta instead of B, since I am more comfortable with it.

A preliminary calculus shows us that:

$$\begin{bmatrix}1 &x\end{bmatrix}^T \begin{bmatrix}1 &x\end{bmatrix} = \begin{bmatrix}n & \sum x \\ \sum x & n\end{bmatrix} = n\begin{bmatrix}1 & \bar{x}\\ \bar{x} & \frac{x^Tx}{n}\end{bmatrix}$$

Here we used the fact that:

$$\begin{bmatrix}1 &x\end{bmatrix}^T \begin{bmatrix}1 &x\end{bmatrix}=\begin{bmatrix}1 & 1 &.. & 1 \\ x_1 & x_2 & .. & x_n\end{bmatrix}\begin{bmatrix}1 & x_1 \\ 1 & x_2 \\ .. & .. \\1 & x_n\end{bmatrix} = \begin{bmatrix}n & \sum x\\\sum x & x^Tx\end{bmatrix}= n\begin{bmatrix}1 & \bar{x} \\ \bar{x} & \frac{x^Tx}{n}\end{bmatrix}$$

Considering that we now have the normal equations for your particular case as

$$\begin{bmatrix}\hat{\beta_0} \\ \hat{\beta_1} \end{bmatrix} = (n\begin{bmatrix}1 & \bar{x}\\ \bar{x} & \frac{x^Tx}{n}\end{bmatrix})^{-1} \begin{bmatrix}1 & x\end{bmatrix}^T y$$

Notice is not easy to invert in formula the covariance matrix, so we will multiply with that matrix on the right to get rid of the inverse. Thus we will obtain:

$$n\begin{bmatrix}1 & \bar{x}\\ \bar{x} & \frac{x^Tx}{n}\end{bmatrix} \begin{bmatrix}\hat{\beta_0} \\ \hat{\beta_1} \end{bmatrix} = \begin{bmatrix}1 & x\end{bmatrix}^T y$$

Moving $n$ to the right we have

$$\begin{bmatrix}1 & \bar{x}\\ \bar{x} & \frac{x^Tx}{n}\end{bmatrix} \begin{bmatrix}\hat{\beta_0} \\ \hat{\beta_1} \end{bmatrix} = \begin{bmatrix}\bar{y} \\ \frac{x^Ty}{n}\end{bmatrix}$$

What we obtained right now is a system of two equations, so both can be used. The first equation is what you already have:

$$\hat{\beta_0}-\bar{x}\hat{\beta_1} = \bar{y}$$

I am convinced that the second equations after some substitutions looks the way you saw it in the book.

As a conclusion the ISTL talks about a particular case and all the beta coefficients are scalars, and the other description works for generic case and beta from there is a vector of coefficients. Hope that helped.