Solved – Show that $\hat{\beta}_0 = \bar{y}$ for OLS when the columns of $\mathbf{X}$ are centered

linear modelridge regression

Let's assume the general linear model $\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon$, where $\mathbf{y} \in \mathbb{R}^N$, $\mathbf{X}$ is a $N \times (p+1)$ matrix (where $p+1 < N$) with all entries in $\mathbb{R}$, $\boldsymbol\beta \in \mathbb{R}^{p+1}$, and $\boldsymbol\epsilon$ is a $N$-dimensional vector of real-valued random variables with $\mathbb{E}[\boldsymbol\epsilon] = \mathbf{0}_{N \times 1}$.

In the development of ridge regression, Introduction to Statistical Learning (p. 215) and Elements of Statistical Learning (p. 64) mention that $\beta_0$ is estimated using $\bar{y} = \dfrac{1}{N}\sum_{i=1}^{N}y_i$ after centering the $\mathbf{X}$ columns, and then each component of $\mathbf{y}$ is centered using $\bar{y}$ prior to performing ridge regression.

Under OLS estimation, $$\hat{\boldsymbol\beta}_{\mathbf{X}} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}\text{.}$$

It can be shown that the matrix
$$\tilde{\mathbf{X}} = \left(\mathbf{I}_{N \times N}-\dfrac{1}{N}\mathbf{1}_{N \times N}\right)\mathbf{X}$$
centers the columns of $\mathbf{X}$, where $\mathbf{1}_{N \times N}$ is the $N \times N$ matrix of all $1$s, and $\mathbf{I}_{N \times N}$ is the $N \times N$ identity matrix.

I am interested in showing that $\hat{\beta}_0$ – i.e., the first component of $\hat{\boldsymbol\beta}$ – is equal to $\bar{y}$ using these assumptions. I thought maybe a previous question would help, but this deals with the case when $\mathbf{X}$ is right-multiplied, rather than left-multiplied.

Using the above,
$$\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} = \mathbf{X}^{T} \left(\mathbf{I}_{N \times N}-\dfrac{1}{N}\mathbf{1}_{N \times N}\right)\mathbf{X}$$

due to that the matrix $\mathbf{I}_{N \times N}-\dfrac{1}{N}\mathbf{1}_{N \times N}$ is symmetric and idempotent.

Let's suppose that $$\mathbf{X} = \begin{bmatrix}
\mathbf{1}_{N \times 1} & \mathbf{x}_1 & \cdots & \mathbf{x}_p
\end{bmatrix}$$
so that $$\mathbf{X}^{T} = \begin{bmatrix}
\mathbf{1}_{N \times 1}^{T} \\
\mathbf{x}_1^{T} \\
\vdots \\
\mathbf{x}_p^{T}
\end{bmatrix}\text{.}$$
We also have
$$\mathbf{I}_{N \times N}-\dfrac{1}{N}\mathbf{1}_{N \times N} = \begin{bmatrix}
1-\frac{1}{N} & -\frac{1}{N} & \cdots & -\frac{1}{N} \\
-\frac{1}{N} & 1 – \frac{1}{N} & \ddots & -\frac{1}{N} \\
\vdots & \ddots & \ddots & -\frac{1}{N} \\
-\frac{1}{N} & \cdots & -\frac{1}{N} & 1 – \frac{1}{N}
\end{bmatrix} $$

As I started thinking about doing the multiplication and calculating an inverse, I think I'm at a dead end here. Any suggestions?

Best Answer

By assumption, your design matrix $X$ can be partitioned as $$X = \begin{bmatrix} 1 & \tilde{X} \end{bmatrix},$$ where $\tilde{X} \in \mathbb{R}^{N \times p}$ satisfies $1^T\tilde{X} = \mathbf{0}^T$ due to the centering. Partition $\beta$ into $[\beta_0, \tilde{\beta}^T]^T$ accordingly. Straightforward calculation shows \begin{align*} \hat{\beta} & = \begin{bmatrix}\hat{\beta}_0 \\ \hat{\tilde{\beta}}\end{bmatrix} \\ & = (X^TX)^{-1}X^Ty \\ & = \begin{bmatrix} 1^T1 & 1^T\tilde{X} \\ \tilde{X}^T1 & \tilde{X}^T\tilde{X} \end{bmatrix}^{-1} \begin{bmatrix}1^T \\ \tilde{X}^T\end{bmatrix} y \\ & = \begin{bmatrix}N^{-1} & 0^T \\ 0 & (\tilde{X}^T\tilde{X})^{-1}\end{bmatrix} \begin{bmatrix}1^Ty \\ \tilde{X}^Ty\end{bmatrix} \\ & = \begin{bmatrix} \bar{y} \\ (\tilde{X}^T\tilde{X})^{-1}\tilde{X}^Ty \end{bmatrix}. \end{align*}

Best Answer

Related Solutions

Solved – Kernel Ridge Regression Algorithmic Efficiency

Solved – Proof that the coefficients in an OLS model follow a t-distribution with (n-k) degrees of freedom

Related Question