Proof for Simple Linear Regression: What am I doing wrong

linear regressionstatistical-inferencestatistics

I am trying to prove the well known formula for simple linear regression $$SS_{TOTAL}=SS_{MODEL}+SS_{ERROR}$$
i.e
$$\sum_{i=1}^n (y_i – \bar{y})^2 =\sum_{i=1}^n (\hat{y}_i-\bar{y})^2+ \sum_{i=1}^n(\hat{y}_i – y_i)^2 $$
and I'm having more trouble than I care to admit. I go down the following road
\begin{align*}
\sum_{i=1}^n (\hat{y}_i-\bar{y})^2+ \sum_{i=1}^n(\hat{y}_i – y_i)^2 &= \sum_{i=1}^{n}(\hat{y}_i^2-2\hat{y}_i\bar{y}+\bar{y}^2) +\sum_{i=1}^{n}(\hat{y}_i^2-2y_i\hat{y}_i+y_i^2)\\
&=\sum_{i=1}^n (\hat{y}_i^2)-2\bar{y}^2n + \bar{y}^2n + \sum_{i=1}^n(\hat{y}_i^2) – \sum_{i=1}^n(2y_i\hat{y}_i)+\sum_{i=1}^n(y_i^2)\\
&=\sum_{i=1}^n (y_i^2)-2\bar{y}^2n + \bar{y}^2n + \sum_{i=1}^n(\hat{y}_i^2) – \sum_{i=1}^n(2y_i\hat{y}_i)+\sum_{i=1}^n(\hat{y}_i^2)\\
&=\sum_{i=1}^n (y_i^2-2y_i\bar{y} + \bar{y}^2) + 2\sum_{i=1}^n(\hat{y}_i^2 – y_i\hat{y}_i)\\
&=SS_{TOTAL}+2\sum_{i=1}^n(\hat{y}_i^2 – y_i\hat{y}_i)
\end{align*}
but I can't see any reason why the right term must be zero. Any help redirecting this ship would be greatly appreciated.

Best Answer

The linear regression line of $y$ on $x$ is of the form

$$\hat y=\bar y+a(x-\bar x)$$

, where $a=rs_y/s_x$, $r$ being the correlation coefficient between $x$ and $y$, and $s_y$ and $s_x$ denoting the standard deviations of $y$ and $x$ respectively.

So for the $i$th observation we have $$\hat y_i=\bar y+a(x_i-\bar x)\quad,\,i=1,2,\ldots,n$$

Summing over all observations, $$\sum_{i=1}^n\hat y_i=n\bar y\quad,\quad\text{ i.e. }\quad\overline{\hat y}=\overline y$$

Now we can do the following simple algebra:

\begin{align} y_i&=\hat y_i+(y_i-\hat y_i) \\\implies y_i-\bar y&=(\hat y_i-\overline{\hat y})+(y_i-\hat y_i) \end{align}

That is, $$\sum_{i=1}^n(y_i-\bar y)^2=\sum_{i=1}^n(\hat y_i-\overline{\hat y})^2+\sum_{i=1}^n(y_i-\hat y_i)^2+2\sum_{i=1}^n(\hat y_i-\overline{\hat y})(y_i-\hat y)$$

Now show that the product term vanishes:

\begin{align} \sum_{i=1}^n(\hat y_i-\overline{\hat y})(y_i-\hat y)&=\sum_{i=1}^n(a(x_i-\bar x))\left((y_i-\bar y)-a(x_i-\bar x)\right) \\&=a\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)-a^2\sum_{i=1}^n(x_i-\bar x)^2 \\&=a\left(n\operatorname{Cov}(x,y)-na\operatorname{Var}(x)\right) \\&=a(nrs_xs_y-nrs_xs_y) \\&=0 \end{align}

Related Question