Regression – Different OLS Regression Procedures Leading to the Same Coefficients

least squareslinear algebraregression

I've rewritten this question, because my phrasing and notation was confusing.

We're assuming OLS regression throughout this post.

If we have the data $\mathbf{y} \in \mathbb{R}^N$, $\mathbf{X}\in \mathbb{R}^{N \times M}$, and $\mathbf{Z} \in \mathbb{R}^{N \times L}$, then consider the following three different procedures to regress the data:

  1. Regress $\mathbf{X}$ on $\mathbf{Z}$ to get $\mathbf{X} = \mathbf{Z}\mathbf{\Gamma} + \mathbf{E}$. Define $\tilde{\mathbf{X}} \equiv \mathbf{E}$. Regress $\mathbf{y}$ on $\tilde{\mathbf{X}}$ to get $\mathbf{y} = \tilde{\mathbf{X}} \hat{\boldsymbol{\beta}}_1 + \boldsymbol{\epsilon}_1$.

  2. Concatenate the columns of $\mathbf{X}$ and $\mathbf{Z}$ to get a $N \times (M+L)$ matrix. Call this matrix $[\mathbf{X}\mathbf{Z}]$. Regress $\mathbf{y}$ on $[\mathbf{X}\mathbf{Z}]$ to get $\mathbf{y} = [\mathbf{X}\mathbf{Z}]\hat{\boldsymbol{\beta}}_{2,\text{total}} + \boldsymbol{\epsilon}_2$. Since we did column concatenation, we can write this as $\mathbf{y} = \mathbf{X} \hat{\boldsymbol{\beta}}_{2,\mathbf{X}} + \mathbf{Z} \hat{\boldsymbol{\beta}}_{2,\mathbf{Z}} + \boldsymbol{\epsilon}_2$.

  3. With $\tilde{\mathbf{X}}$ defined above, concatenate the columns of $\tilde{\mathbf{X}}$ and $\mathbf{Z}$ to get $[\tilde{\mathbf{X}} \mathbf{Z}]$. Regress $\mathbf{y}$ on $[\tilde{\mathbf{X}} \mathbf{Z}]$ to get $\mathbf{y} = [\tilde{\mathbf{X}} \mathbf{Z}] \hat{\boldsymbol{\beta}}_{3, \text{total}} + \boldsymbol{\epsilon}_3$. We can write this as $\mathbf{y} = \tilde{\mathbf{X}}\hat{\boldsymbol{\beta}}_{3,\tilde{\mathbf{X}}} + \mathbf{Z} \hat{\boldsymbol{\beta}}_{3,\mathbf{Z}} + \boldsymbol{\epsilon}_3$.

It turns out that:

  1. $\hat{\boldsymbol{\beta}}_1 = \hat{\boldsymbol{\beta}}_{2,\mathbf{X}} = \hat{\boldsymbol{\beta}}_{3, \tilde{\mathbf{X}}}$
  2. $\boldsymbol{\epsilon} _3 = \boldsymbol{\epsilon}_2 = \boldsymbol{\epsilon}_1 – \mathbf{Z}(\mathbf{Z}^\intercal \mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y}$

I found these equalities by calculating the coefficients from the OLS regression coefficient formula $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal\mathbf{y}$ (assuming that $\mathbf{X}^\intercal\mathbf{X}$ is invertible), and I'll show the steps at the end–it's just some linear algebra stuff.

My question is: can we prove these equalities without taking pains to do the matrix algebra? In other words, although I know these equalities hold, I don't know why they hold. It might be worth noting that $\tilde{\mathbf{X}}$ is uncorrelated with $\mathbf{Z}$, but I'm not sure how to make a general argument around that.

—————-Below is the calculation—————-

Since the regressions are all OLS, we have:

$\mathbf{\Gamma} = (\mathbf{Z}^\intercal \mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X}$,

$\hat{\boldsymbol{\beta}}_1 = (\tilde{\mathbf{X}}^\intercal \tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^\intercal\mathbf{y}$

$\hat{\boldsymbol{\beta}}_{2,\text{total}} = ([\mathbf{X}\mathbf{Z}]^\intercal[\mathbf{X}\mathbf{Z}])^{-1}[\mathbf{X}\mathbf{Z}]^\intercal\mathbf{y}$,

$\hat{\boldsymbol{\beta}}_{3,\text{total}} = ([\tilde{\mathbf{X}}\mathbf{Z}]^\intercal[\tilde{\mathbf{X}}\mathbf{Z}])^{-1}[\tilde{\mathbf{X}}\mathbf{Z}]^\intercal\mathbf{y}$.

We may express $\hat{\boldsymbol{\beta}}_1$, $\hat{\boldsymbol{\beta}}_{2,\text{total}}$, and $\hat{\boldsymbol{\beta}}_{3,\text{total}}$ in terms of $\mathbf{X}$, $\mathbf{y}$, and $\mathbf{Z}$ for the sake of comparison.

First, let's calculate $\hat{\boldsymbol{\beta}}_1$.

Plug in $\mathbf{\Gamma}$ to the definition of $\tilde{\mathbf{X}}$, we get $\tilde{\mathbf{X}} \equiv \mathbf{E} = \mathbf{X} – \mathbf{Z}\mathbf{\Gamma} = \mathbf{X} – \mathbf{Z} (\mathbf{Z}^\intercal \mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X}$.

To keep the equations compact, define $\mathbf{S} \equiv \mathbf{Z} (\mathbf{Z}^\intercal \mathbf{Z})^{-1}\mathbf{Z}^\intercal$. So, $\tilde{\mathbf{X}} = (\mathbf{I} – \mathbf{S})\mathbf{X}$.

Note the following properties of $\mathbf{S}$:

  1. $\mathbf{S}^\intercal = \mathbf{S}$

  2. $(\mathbf{I}-\mathbf{S})^\intercal = \mathbf{I}-\mathbf{S}$

  3. $\mathbf{S} \mathbf{S} = \mathbf{S}$

  4. $(\mathbf{I}-\mathbf{S})(\mathbf{I}-\mathbf{S}) = \mathbf{I}-\mathbf{S}$

  5. $\mathbf{S}(\mathbf{I}-\mathbf{S}) = (\mathbf{I}-\mathbf{S})\mathbf{S} = 0$

Now, plug in $\tilde{\mathbf{X}}$ to the equation for $\hat{\boldsymbol{\beta}}_1$, we get:

$$\begin{align}\hat{\boldsymbol{\beta}}_1 &= (\mathbf{X}^\intercal(\mathbf{I} – \mathbf{S})^\intercal (\mathbf{I} – \mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I} – \mathbf{S})^\intercal\mathbf{y} \\ &=(\mathbf{X}^\intercal(\mathbf{I} – \mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I} – \mathbf{S})\mathbf{y}\end{align}$$

To calculate $\hat{\boldsymbol{\beta}}_{2,\text{total}}$, we use the following equation:

$\begin{align}\begin{bmatrix}\mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D}\end{bmatrix}^{-1}=\begin{bmatrix}\mathbf{P} & -\mathbf{P} \mathbf{B} \mathbf{D}^{-1}\\-\mathbf{D}^{-1}\mathbf{C}\mathbf{P} & \mathbf{D}^{-1}+\mathbf{D}^{-1}\mathbf{C}\mathbf{P}\mathbf{B}\mathbf{D}^{-1}\end{bmatrix}\end{align}$

where $\mathbf{P} = (\mathbf{A} – \mathbf{B} \mathbf{D}^{-1}\mathbf{C})^{-1}$, assuming $\mathbf{D}$ invertible.

With this equation, we can expand $([\mathbf{X}\mathbf{Z}]^\intercal[\mathbf{X}\mathbf{Z}])^{-1}$ in the equation for $\hat{\boldsymbol{\beta}}_{2,\text{total}}$.

Since $([\mathbf{X}\mathbf{Z}]^\intercal[\mathbf{X}\mathbf{Z}])^{-1} = \begin{bmatrix}\mathbf{X}^\intercal\mathbf{X} & \mathbf{X}^\intercal\mathbf{Z} \\ \mathbf{Z}^\intercal \mathbf{X} & \mathbf{Z}^\intercal \mathbf{Z}\end{bmatrix}^{-1}$, let $\mathbf{A} = \mathbf{X}^\intercal\mathbf{X}$, $\mathbf{B} = \mathbf{X}^\intercal\mathbf{Z}$, $\mathbf{C} = \mathbf{Z}^\intercal \mathbf{X}$, and $\mathbf{D} = \mathbf{Z}^\intercal\mathbf{Z}$.

So, $\mathbf{P} = (\mathbf{X}^\intercal\mathbf{X}-\mathbf{X}^\intercal\mathbf{Z}(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X})^{-1} = (\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}$.

As mentioned in the description of the procedure, we can write $\hat{\boldsymbol{\beta}}_{2,\text{total}}$ as the row concatenation of $\hat{\boldsymbol{\beta}}_{2,\mathbf{X}}$ and $\hat{\boldsymbol{\beta}}_{2,\mathbf{Z}}$. We calculate them separately here.

$$\begin{align}\hat{\boldsymbol{\beta}}_{2,\mathbf{X}} &= \mathbf{P}\mathbf{X}^\intercal\mathbf{y} – \mathbf{P}\mathbf{B}\mathbf{D}^{-1}\mathbf{Z}^\intercal\mathbf{y}\\&=(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y} – (\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{Z}(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y}\\&=(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}\end{align}$$

$$\begin{align}\hat{\boldsymbol{\beta}}_{2,\mathbf{Z}} &= -\mathbf{D}^{-1}\mathbf{C}\mathbf{P}\mathbf{X}^\intercal\mathbf{y} + (\mathbf{D}^{-1}+\mathbf{D}^{-1}\mathbf{C}\mathbf{P}\mathbf{B}\mathbf{D}^{-1})\mathbf{Z}^\intercal\mathbf{y}\\&=-(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y} + (\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y} + (\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{Z}(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y}\\&=-(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}+(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y}\end{align}$$

To calculate $\hat{\boldsymbol{\beta}}_{3,\text{total}}$, we again calculate $\hat{\boldsymbol{\beta}}_{3,\tilde{\mathbf{X}}}$ and $\hat{\boldsymbol{\beta}}_{3,\mathbf{Z}}$ separately.

But, to do this, note that we only need to substitute every $\mathbf{X}$ in $\hat{\boldsymbol{\beta}}_{2,\mathbf{X}}$ and $\hat{\boldsymbol{\beta}}_{2,\mathbf{Z}}$ with $\tilde{\mathbf{X}}$.

$$\begin{align}\hat{\boldsymbol{\beta}}_{3,\tilde{\mathbf{X}}} &= (\tilde{\mathbf{X}}^\intercal(\mathbf{I}-\mathbf{S})\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y} \\ &= (\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})^\intercal(\mathbf{I}-\mathbf{S})(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y} \\ &= (\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}\end{align}$$

We're not going to explicitly do the substitution for $\hat{\boldsymbol{\beta}}_{3,\mathbf{Z}}$, and the reason will be clear as we compare the $\boldsymbol{\epsilon}$'s.

So far, we have demonstrated that $\hat{\boldsymbol{\beta}}_1 = \hat{\boldsymbol{\beta}}_{2,\mathbf{X}} = \hat{\boldsymbol{\beta}}_{3, \tilde{\mathbf{X}}}=(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}$. We can now demonstrate the relation between $\boldsymbol{\epsilon}_1$, $\boldsymbol{\epsilon}_2$, and $\boldsymbol{\epsilon}_3$.

We directly calculate $\boldsymbol{\epsilon}_1$ and $\boldsymbol{\epsilon}_2$ as follows.

$$\begin{align}\boldsymbol{\epsilon}_1 &= \mathbf{y} – \tilde{\mathbf{X}}\hat{\boldsymbol{\beta}}_1 \\ &= \mathbf{y} – (\mathbf{I}-\mathbf{S})\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I} – \mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I} – \mathbf{S})\mathbf{y}\end{align}$$

$$\begin{align}\boldsymbol{\epsilon}_2 &= \mathbf{y} – \mathbf{X} \hat{\boldsymbol{\beta}}_{2,\mathbf{X}} – \mathbf{Z} \hat{\boldsymbol{\beta}}_{2,\mathbf{Z}} \\ &= \mathbf{y} -\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y} – \mathbf{Z}\left\{-(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}+(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y}\right\} \\ &= \mathbf{y} -\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y} + \mathbf{S}\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y} – \mathbf{S}\mathbf{y} \\ &= \mathbf{y} – (\mathbf{I}-\mathbf{S})\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I} – \mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I} – \mathbf{S})\mathbf{y} – \mathbf{S}\mathbf{y} \end{align}$$

Indeed, $\boldsymbol{\epsilon}_2 = \boldsymbol{\epsilon}_1 -\mathbf{S}\mathbf{y}$.

To calculate $\boldsymbol{\epsilon}_3$, we first examine the term $\mathbf{Z} \hat{\boldsymbol{\beta}}_{3,\mathbf{Z}}$. Remember that we substitute $\mathbf{X}$ in $\hat{\boldsymbol{\beta}}_{2,\mathbf{Z}}$ with $\tilde{\mathbf{X}}$ to get $\hat{\boldsymbol{\beta}}_{3,\mathbf{Z}}$. So,

$\begin{align}\mathbf{Z} \hat{\boldsymbol{\beta}}_{3,\mathbf{Z}} &= -\mathbf{Z}(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\tilde{\mathbf{X}}(\tilde{\mathbf{X}}^\intercal(\mathbf{I}-\mathbf{S})\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}+\mathbf{Z}(\mathbf{Z}^\intercal\mathbf{Z})^{-1}\mathbf{Z}^\intercal\mathbf{y}\\ &= -\mathbf{S}\tilde{\mathbf{X}}(\tilde{\mathbf{X}}^\intercal(\mathbf{I}-\mathbf{S})\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y}+\mathbf{S}\mathbf{y}\end{align}$

But $\mathbf{S}\tilde{\mathbf{X}} = \mathbf{S}(\mathbf{I}-\mathbf{S})\mathbf{X} = 0$ by the fifth property of $\mathbf{S}$. Therefore, $\mathbf{Z} \hat{\boldsymbol{\beta}}_{3,\mathbf{Z}} = \mathbf{S}\mathbf{y}$, and

$$\begin{align}\boldsymbol{\epsilon}_3 &= \mathbf{y} – \tilde{\mathbf{X}}\hat{\boldsymbol{\beta}}_{3,\tilde{\mathbf{X}}} – \mathbf{Z} \hat{\boldsymbol{\beta}}_{3,\mathbf{Z}} \\ &= \mathbf{y} – (\mathbf{I}-\mathbf{S})\mathbf{X}(\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{I}-\mathbf{S})\mathbf{y} – \mathbf{S}\mathbf{y}\end{align}$$

Thus, $\boldsymbol{\epsilon}_3=\boldsymbol{\epsilon}_2=\boldsymbol{\epsilon}_1-\mathbf{S}\mathbf{y}$.

Best Answer

Below is a geometric viewpoint similar to an answer to a different question: Intuition behind $(X^TX)^{-1}$ in closed form of w in Linear Regression

image from other question

The regression is a perpendicular projection onto the vectors in the columns of $X$ and $Z$. What you are basically doing is defining a different vector $\bar{X}$ such that the coordinates associated with the projection remain the same.

This alternative vector is drawn in the image with a red on the right side.

The vector $\bar{X}$ is perpendicular to $Z$ and that is why all those coefficients $\beta$ turn out to be the same.

If $Z$ and $\bar{X}$ are perpendicular then

  • The regression $$Y \sim \beta_1 \bar{X} + \beta_2 Z$$ and $$Y \sim \beta_1^\prime \bar{X} $$ will be the same in the sense $\beta_1 = \beta_1^\prime$

  • The regression $$Y \sim \beta_1 \bar{X} + \beta_2 Z$$ and $$Y \sim \beta_1^{\prime\prime} (\bar{X} + a Z) + \beta_2 Z$$ with $a$ some constant, will be the same in the sense $\beta_1 = \beta_1^{\prime\prime}$.

    Note that we can write $X = \bar{X} + a Z$. The difference between $X$ and $\bar{X}$ is some multiple of $Z$.

Related Question