I've rewritten this question, because my phrasing and notation was confusing.
We're assuming OLS regression throughout this post.
If we have the data $\mathbf{y} \in \mathbb{R}^N$, $\mathbf{X}\in \mathbb{R}^{N \times M}$, and $\mathbf{Z} \in \mathbb{R}^{N \times L}$, then consider the following three different procedures to regress the data:
-
Regress $\mathbf{X}$ on $\mathbf{Z}$ to get $\mathbf{X} = \mathbf{Z}\mathbf{\Gamma} + \mathbf{E}$. Define $\tilde{\mathbf{X}} \equiv \mathbf{E}$. Regress $\mathbf{y}$ on $\tilde{\mathbf{X}}$ to get $\mathbf{y} = \tilde{\mathbf{X}} \hat{\boldsymbol{\beta}}_1 + \boldsymbol{\epsilon}_1$.
-
Concatenate the columns of $\mathbf{X}$ and $\mathbf{Z}$ to get a $N \times (M+L)$ matrix. Call this matrix $[\mathbf{X}\mathbf{Z}]$. Regress $\mathbf{y}$ on $[\mathbf{X}\mathbf{Z}]$ to get $\mathbf{y} = [\mathbf{X}\mathbf{Z}]\hat{\boldsymbol{\beta}}_{2,\text{total}} + \boldsymbol{\epsilon}_2$. Since we did column concatenation, we can write this as $\mathbf{y} = \mathbf{X} \hat{\boldsymbol{\beta}}_{2,\mathbf{X}} + \mathbf{Z} \hat{\boldsymbol{\beta}}_{2,\mathbf{Z}} + \boldsymbol{\epsilon}_2$.
-
With $\tilde{\mathbf{X}}$ defined above, concatenate the columns of $\tilde{\mathbf{X}}$ and $\mathbf{Z}$ to get $[\tilde{\mathbf{X}} \mathbf{Z}]$. Regress $\mathbf{y}$ on $[\tilde{\mathbf{X}} \mathbf{Z}]$ to get $\mathbf{y} = [\tilde{\mathbf{X}} \mathbf{Z}] \hat{\boldsymbol{\beta}}_{3, \text{total}} + \boldsymbol{\epsilon}_3$. We can write this as $\mathbf{y} = \tilde{\mathbf{X}}\hat{\boldsymbol{\beta}}_{3,\tilde{\mathbf{X}}} + \mathbf{Z} \hat{\boldsymbol{\beta}}_{3,\mathbf{Z}} + \boldsymbol{\epsilon}_3$.
It turns out that:
- $\hat{\boldsymbol{\beta}}_1 = \hat{\boldsymbol{\beta}}_{2,\mathbf{X}} = \hat{\boldsymbol{\beta}}_{3, \tilde{\mathbf{X}}}$
- $\boldsymbol{\epsilon} _3 = \boldsymbol{\epsilon}_2 = \boldsymbol{\epsilon}_1 – \mathbf{Z}(\mathbf{Z}^\intercal \mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y}$
I found these equalities by calculating the coefficients from the OLS regression coefficient formula $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal\mathbf{y}$ (assuming that $\mathbf{X}^\intercal\mathbf{X}$ is invertible), and I'll show the steps at the end–it's just some linear algebra stuff.
My question is: can we prove these equalities without taking pains to do the matrix algebra? In other words, although I know these equalities hold, I don't know why they hold. It might be worth noting that $\tilde{\mathbf{X}}$ is uncorrelated with $\mathbf{Z}$, but I'm not sure how to make a general argument around that.
—————-Below is the calculation—————-
Since the regressions are all OLS, we have:
$\mathbf{\Gamma} = (\mathbf{Z}^\intercal \mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X}$,
$\hat{\boldsymbol{\beta}}_1 = (\tilde{\mathbf{X}}^\intercal \tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^\intercal\mathbf{y}$
$\hat{\boldsymbol{\beta}}_{2,\text{total}} = ([\mathbf{X}\mathbf{Z}]^\intercal[\mathbf{X}\mathbf{Z}])^{-1}[\mathbf{X}\mathbf{Z}]^\intercal\mathbf{y}$,
$\hat{\boldsymbol{\beta}}_{3,\text{total}} = ([\tilde{\mathbf{X}}\mathbf{Z}]^\intercal[\tilde{\mathbf{X}}\mathbf{Z}])^{-1}[\tilde{\mathbf{X}}\mathbf{Z}]^\intercal\mathbf{y}$.
We may express $\hat{\boldsymbol{\beta}}_1$, $\hat{\boldsymbol{\beta}}_{2,\text{total}}$, and $\hat{\boldsymbol{\beta}}_{3,\text{total}}$ in terms of $\mathbf{X}$, $\mathbf{y}$, and $\mathbf{Z}$ for the sake of comparison.
First, let's calculate $\hat{\boldsymbol{\beta}}_1$.
Plug in $\mathbf{\Gamma}$ to the definition of $\tilde{\mathbf{X}}$, we get $\tilde{\mathbf{X}} \equiv \mathbf{E} = \mathbf{X} – \mathbf{Z}\mathbf{\Gamma} = \mathbf{X} – \mathbf{Z} (\mathbf{Z}^\intercal \mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X}$.
To keep the equations compact, define $\mathbf{S} \equiv \mathbf{Z} (\mathbf{Z}^\intercal \mathbf{Z})^{-1}\mathbf{Z}^\intercal$. So, $\tilde{\mathbf{X}} = (\mathbf{I} – \mathbf{S})\mathbf{X}$.
Note the following properties of $\mathbf{S}$:
-
$\mathbf{S}^\intercal = \mathbf{S}$
-
$(\mathbf{I}-\mathbf{S})^\intercal = \mathbf{I}-\mathbf{S}$
-
$\mathbf{S} \mathbf{S} = \mathbf{S}$
-
$(\mathbf{I}-\mathbf{S})(\mathbf{I}-\mathbf{S}) = \mathbf{I}-\mathbf{S}$
-
$\mathbf{S}(\mathbf{I}-\mathbf{S}) = (\mathbf{I}-\mathbf{S})\mathbf{S} = 0$
Now, plug in $\tilde{\mathbf{X}}$ to the equation for $\hat{\boldsymbol{\beta}}_1$, we get:
$$\begin{align}\hat{\boldsymbol{\beta}}_1 &= (\mathbf{X}^\intercal(\mathbf{I} – \mathbf{S})^\intercal (\mathbf{I} – \mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I} – \mathbf{S})^\intercal\mathbf{y} \\ &=(\mathbf{X}^\intercal(\mathbf{I} – \mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I} – \mathbf{S})\mathbf{y}\end{align}$$
To calculate $\hat{\boldsymbol{\beta}}_{2,\text{total}}$, we use the following equation:
$\begin{align}\begin{bmatrix}\mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D}\end{bmatrix}^{-1}=\begin{bmatrix}\mathbf{P} & -\mathbf{P} \mathbf{B} \mathbf{D}^{-1}\\-\mathbf{D}^{-1}\mathbf{C}\mathbf{P} & \mathbf{D}^{-1}+\mathbf{D}^{-1}\mathbf{C}\mathbf{P}\mathbf{B}\mathbf{D}^{-1}\end{bmatrix}\end{align}$
where $\mathbf{P} = (\mathbf{A} – \mathbf{B} \mathbf{D}^{-1}\mathbf{C})^{-1}$, assuming $\mathbf{D}$ invertible.
With this equation, we can expand $([\mathbf{X}\mathbf{Z}]^\intercal[\mathbf{X}\mathbf{Z}])^{-1}$ in the equation for $\hat{\boldsymbol{\beta}}_{2,\text{total}}$.
Since $([\mathbf{X}\mathbf{Z}]^\intercal[\mathbf{X}\mathbf{Z}])^{-1} = \begin{bmatrix}\mathbf{X}^\intercal\mathbf{X} & \mathbf{X}^\intercal\mathbf{Z} \\ \mathbf{Z}^\intercal \mathbf{X} & \mathbf{Z}^\intercal \mathbf{Z}\end{bmatrix}^{-1}$, let $\mathbf{A} = \mathbf{X}^\intercal\mathbf{X}$, $\mathbf{B} = \mathbf{X}^\intercal\mathbf{Z}$, $\mathbf{C} = \mathbf{Z}^\intercal \mathbf{X}$, and $\mathbf{D} = \mathbf{Z}^\intercal\mathbf{Z}$.
So, $\mathbf{P} = (\mathbf{X}^\intercal\mathbf{X}-\mathbf{X}^\intercal\mathbf{Z}(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X})^{-1} = (\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}$.
As mentioned in the description of the procedure, we can write $\hat{\boldsymbol{\beta}}_{2,\text{total}}$ as the row concatenation of $\hat{\boldsymbol{\beta}}_{2,\mathbf{X}}$ and $\hat{\boldsymbol{\beta}}_{2,\mathbf{Z}}$. We calculate them separately here.
$$\begin{align}\hat{\boldsymbol{\beta}}_{2,\mathbf{X}} &= \mathbf{P}\mathbf{X}^\intercal\mathbf{y} – \mathbf{P}\mathbf{B}\mathbf{D}^{-1}\mathbf{Z}^\intercal\mathbf{y}\\&=(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y} – (\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{Z}(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y}\\&=(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}\end{align}$$
$$\begin{align}\hat{\boldsymbol{\beta}}_{2,\mathbf{Z}} &= -\mathbf{D}^{-1}\mathbf{C}\mathbf{P}\mathbf{X}^\intercal\mathbf{y} + (\mathbf{D}^{-1}+\mathbf{D}^{-1}\mathbf{C}\mathbf{P}\mathbf{B}\mathbf{D}^{-1})\mathbf{Z}^\intercal\mathbf{y}\\&=-(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y} + (\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y} + (\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{Z}(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y}\\&=-(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}+(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y}\end{align}$$
To calculate $\hat{\boldsymbol{\beta}}_{3,\text{total}}$, we again calculate $\hat{\boldsymbol{\beta}}_{3,\tilde{\mathbf{X}}}$ and $\hat{\boldsymbol{\beta}}_{3,\mathbf{Z}}$ separately.
But, to do this, note that we only need to substitute every $\mathbf{X}$ in $\hat{\boldsymbol{\beta}}_{2,\mathbf{X}}$ and $\hat{\boldsymbol{\beta}}_{2,\mathbf{Z}}$ with $\tilde{\mathbf{X}}$.
$$\begin{align}\hat{\boldsymbol{\beta}}_{3,\tilde{\mathbf{X}}} &= (\tilde{\mathbf{X}}^\intercal(\mathbf{I}-\mathbf{S})\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y} \\ &= (\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})^\intercal(\mathbf{I}-\mathbf{S})(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y} \\ &= (\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}\end{align}$$
We're not going to explicitly do the substitution for $\hat{\boldsymbol{\beta}}_{3,\mathbf{Z}}$, and the reason will be clear as we compare the $\boldsymbol{\epsilon}$'s.
So far, we have demonstrated that $\hat{\boldsymbol{\beta}}_1 = \hat{\boldsymbol{\beta}}_{2,\mathbf{X}} = \hat{\boldsymbol{\beta}}_{3, \tilde{\mathbf{X}}}=(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}$. We can now demonstrate the relation between $\boldsymbol{\epsilon}_1$, $\boldsymbol{\epsilon}_2$, and $\boldsymbol{\epsilon}_3$.
We directly calculate $\boldsymbol{\epsilon}_1$ and $\boldsymbol{\epsilon}_2$ as follows.
$$\begin{align}\boldsymbol{\epsilon}_1 &= \mathbf{y} – \tilde{\mathbf{X}}\hat{\boldsymbol{\beta}}_1 \\ &= \mathbf{y} – (\mathbf{I}-\mathbf{S})\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I} – \mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I} – \mathbf{S})\mathbf{y}\end{align}$$
$$\begin{align}\boldsymbol{\epsilon}_2 &= \mathbf{y} – \mathbf{X} \hat{\boldsymbol{\beta}}_{2,\mathbf{X}} – \mathbf{Z} \hat{\boldsymbol{\beta}}_{2,\mathbf{Z}} \\ &= \mathbf{y} -\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y} – \mathbf{Z}\left\{-(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}+(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y}\right\} \\ &= \mathbf{y} -\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y} + \mathbf{S}\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y} – \mathbf{S}\mathbf{y} \\ &= \mathbf{y} – (\mathbf{I}-\mathbf{S})\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I} – \mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I} – \mathbf{S})\mathbf{y} – \mathbf{S}\mathbf{y} \end{align}$$
Indeed, $\boldsymbol{\epsilon}_2 = \boldsymbol{\epsilon}_1 -\mathbf{S}\mathbf{y}$.
To calculate $\boldsymbol{\epsilon}_3$, we first examine the term $\mathbf{Z} \hat{\boldsymbol{\beta}}_{3,\mathbf{Z}}$. Remember that we substitute $\mathbf{X}$ in $\hat{\boldsymbol{\beta}}_{2,\mathbf{Z}}$ with $\tilde{\mathbf{X}}$ to get $\hat{\boldsymbol{\beta}}_{3,\mathbf{Z}}$. So,
$\begin{align}\mathbf{Z} \hat{\boldsymbol{\beta}}_{3,\mathbf{Z}} &= -\mathbf{Z}(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\tilde{\mathbf{X}}(\tilde{\mathbf{X}}^\intercal(\mathbf{I}-\mathbf{S})\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}+\mathbf{Z}(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y}\\ &= -\mathbf{S}\tilde{\mathbf{X}}(\tilde{\mathbf{X}}^\intercal(\mathbf{I}-\mathbf{S})\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}+\mathbf{S}\mathbf{y}\end{align}$
But $\mathbf{S}\tilde{\mathbf{X}} = \mathbf{S}(\mathbf{I}-\mathbf{S})\mathbf{X} = 0$ by the fifth property of $\mathbf{S}$. Therefore, $\mathbf{Z} \hat{\boldsymbol{\beta}}_{3,\mathbf{Z}} = \mathbf{S}\mathbf{y}$, and
$$\begin{align}\boldsymbol{\epsilon}_3 &= \mathbf{y} – \tilde{\mathbf{X}}\hat{\boldsymbol{\beta}}_{3,\tilde{\mathbf{X}}} – \mathbf{Z} \hat{\boldsymbol{\beta}}_{3,\mathbf{Z}} \\ &= \mathbf{y} – (\mathbf{I}-\mathbf{S})\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y} – \mathbf{S}\mathbf{y}\end{align}$$
Thus, $\boldsymbol{\epsilon}_3=\boldsymbol{\epsilon}_2=\boldsymbol{\epsilon}_1-\mathbf{S}\mathbf{y}$.
Best Answer
Below is a geometric viewpoint similar to an answer to a different question: Intuition behind $(X^TX)^{-1}$ in closed form of w in Linear Regression
The regression is a perpendicular projection onto the vectors in the columns of $X$ and $Z$. What you are basically doing is defining a different vector $\bar{X}$ such that the coordinates associated with the projection remain the same.
This alternative vector is drawn in the image with a red on the right side.
The vector $\bar{X}$ is perpendicular to $Z$ and that is why all those coefficients $\beta$ turn out to be the same.
If $Z$ and $\bar{X}$ are perpendicular then
The regression $$Y \sim \beta_1 \bar{X} + \beta_2 Z$$ and $$Y \sim \beta_1^\prime \bar{X} $$ will be the same in the sense $\beta_1 = \beta_1^\prime$
The regression $$Y \sim \beta_1 \bar{X} + \beta_2 Z$$ and $$Y \sim \beta_1^{\prime\prime} (\bar{X} + a Z) + \beta_2 Z$$ with $a$ some constant, will be the same in the sense $\beta_1 = \beta_1^{\prime\prime}$.
Note that we can write $X = \bar{X} + a Z$. The difference between $X$ and $\bar{X}$ is some multiple of $Z$.