Solved – Linear regression: *Why* can you partition sums of squares

orthogonalregressionsums-of-squares

This post refers to a bivariate linear regression model, $Y_i = \beta_0 + \beta_1x_i$ . I have always taken the partitioning of total sum of squares (SSTO) into sum of squares for error (SSE) and sum of squares for the model (SSR) on faith, but once I started really thinking about it, I don't understand why it works…

The part I do understand:

$y_i$: An observed value of y

$\bar{y}$: The mean of all observed $y_i$s

$\hat{y}_i$: The fitted/predicted value of y for a given observation's x

$y_i – \hat{y}_i$: Residual/error (if squared and added up for all observations this is SSE)

$\hat{y}_i – \bar{y}$: How much the model fitted value differs from the mean (if squared and added up for all observations this is SSR)

$y_i – \bar{y}$: How much an observed value differs from the mean (if suared and added up for all observations, this is SSTO).

I can understand why, for a single observation, without squaring anything, $(y_i – \bar{y}) = (\hat{y}_i – \bar{y}) + (y_i – \hat{y}_i)$. And I can understand why, if you want to add things up over all observations, you have to square them or they'll add up to 0.

The part I don't understand is why $(y_i – \bar{y})^2 = (\hat{y}_i – \bar{y})^2 + (y_i – \hat{y}_i)^2$ (eg. SSTO = SSR + SSE). It seems to be that if you have a situation where $A = B + C$, then $A^2 = B^2 + 2BC + C^2$, not $A^2 = B^2 + C^2$. Why isn't that the case here?

Best Answer

It seems to be that if you have a situation where $A = B + C$, then $A^2 = B^2 + 2BC + C^2$, not $A^2 = B^2 + C^2$. Why isn't that the case here?

Conceptually, the idea is that $BC = 0$ because $B$ and $C$ are orthogonal (i.e. are perpendicular).


In the context of linear regression here, the residuals $\epsilon_i = y_i - \hat{y}_i$ are orthogonal to the demeaned forecast $\hat{y}_i - \bar{y}$. The forecast from linear regression creates an orthogonal decomposition of $\mathbf{y}$ in a similar sense as $(3,4) = (3,0) + (0,4)$ is an orthogonal decomposition.

Linear Algebra version:

Let:

$$\mathbf{z} = \begin{bmatrix} y_1 - \bar{y} \\ y_2 - \bar{y}\\ \ldots \\ y_n - \bar{y} \end{bmatrix} \quad \quad \mathbf{\hat{z}} = \begin{bmatrix} \hat{y}_1 - \bar{y} \\ \hat{y}_2 - \bar{y} \\ \ldots \\ \hat{y}_n - \bar{y} \end{bmatrix} \quad \quad \boldsymbol{\epsilon} = \begin{bmatrix} y_1 - \hat{y}_1 \\ y_2 - \hat{y}_2 \\ \ldots \\ y_n - \hat{y}_n \end{bmatrix} = \mathbf{z} - \hat{\mathbf{z}}$$

Linear regression (with a constant included) decomposes $\mathbf{z}$ into the sum of two vectors: a forecast $\hat{\mathbf{z}}$ and a residual $\boldsymbol{\epsilon}$

$$ \mathbf{z} = \hat{\mathbf{z}} + \boldsymbol{\epsilon} $$

Let $\langle .,. \rangle$ denote the dot product. (More generally, $\langle X,Y \rangle$ can be the inner product $E[XY]$.)

\begin{align*} \langle \mathbf{z} , \mathbf{z} \rangle &= \langle \hat{\mathbf{z}} + \boldsymbol{\epsilon}, \hat{\mathbf{z}} + \boldsymbol{\epsilon} \rangle \\ &= \langle \hat{\mathbf{z}}, \hat{\mathbf{z}} \rangle + 2 \langle \hat{\mathbf{z}},\boldsymbol{\epsilon} \rangle + \langle \boldsymbol{\epsilon},\boldsymbol{\epsilon} \rangle \\ &= \langle \hat{\mathbf{z}}, \hat{\mathbf{z}} \rangle + \langle \boldsymbol{\epsilon},\boldsymbol{\epsilon} \rangle \end{align*}

Where the last line follows from the fact that $\langle \hat{\mathbf{z}},\boldsymbol{\epsilon} \rangle = 0$ (i.e. that $\hat{\mathbf{z}}$ and $\boldsymbol{\epsilon} = \mathbf{z}- \hat{\mathbf{z}}$ are orthogonal). You can prove $\hat{\mathbf{z}}$ and $\boldsymbol{\epsilon}$ are orthogonal based upon how the ordinary least squares regression constructs $\hat{\mathbf{z}}$.

$\hat{\mathbf{z}}$ is the linear projection of $\mathbf{z}$ onto the subspace defined by the linear span of the regressors $\mathbf{x}_1$, $\mathbf{x}_2$, etc.... The residual $\boldsymbol{\epsilon}$ is orthogonal to that entire subspace hence $\hat{\mathbf{z}}$ (which lies in the span of $\mathbf{x}_1$, $\mathbf{x}_2$, etc...) is orthogonal to $\boldsymbol{\epsilon}$.


Note that as I defined $\langle .,.\rangle$ as the dot product, $\langle \mathbf{z} , \mathbf{z} \rangle = \langle \hat{\mathbf{z}}, \hat{\mathbf{z}} \rangle + \langle \boldsymbol{\epsilon},\boldsymbol{\epsilon} \rangle $ is simply another way of writing $\sum_i (y_i - \bar{y})^2 = \sum_i (\hat{y}_i - \bar{y})^2 + \sum_i (y_i - \hat{y}_i)^2$ (i.e. SSTO = SSR + SSE)