Solved – Decomposing total sum of squares

linear modelregressionsums-of-squares

Consider the general linear regression model:
$$y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \cdots + \beta_px_{ip} + \epsilon_i = \mathbf{x}_i^t \beta + \epsilon_i$$

where $\textbf{x}_i = (1,x_{i1},x_{i2},\cdots,x_{ip})^T$, $\beta=(\beta_0,\beta_1,\cdots,\beta_p)^T$ and $\epsilon_i$ are iid N(0,$\sigma^2$).

I would like to see a complete proof of the following identity from first principles:
$$\sum_{i=1}^n(y_i – \bar{y})^2 = \sum_{i=1}^n(\hat{y}_i – \bar{y})^2 + \sum_{i=1}^n(y_i – \hat{y}_i)^2$$
where $\hat{y}_i= \mathbf{x}_i^t \hat{\beta} $ ($\hat{\beta}$ is the least square estimator, $\bar{y}$ ia the sample mean of $y_i$).

I know that the two terms on the right can be obtained by subtracting and adding $\hat{y}_i$ on the left side. But this introduces a "cross term":
$$\sum_{i=1}^n2(\hat{y}_i – \bar{y})(y_i – \hat{y}_i)$$

Many texts claim that this is zero, but I have not seen a general proof of this statement. How can this be shown?

Best Answer

Split like so:

$=\sum_{i=1}^n \hat{y}_i (y_i -\hat{y}_i)-\bar{y} \sum_{i=1}^n (y_i - \hat{y}_i) $

$=\sum_{i=1}^n \hat{y}_i e_i -\bar{y} \sum_{i=1}^n e_i $

(where $e_i$ is the $i$-th residual)

$=\sum_{i=1}^n \hat{y}_i e_i$

Can you do it from there?

Related Question