Solved – Total Sum of Squares, Covariance between residuals and the predicted values

covarianceresidualsvariance

This is more of a follow up question regarding: Confused with Residual Sum of Squares and Total Sum of Squares.

Total sum of squares can be represented as:

$$\displaystyle \sum_i ({y}_i-\hat{y}_i)^2+2\sum_i ({y}_i-\hat{y}_i)(\hat{y}_i-\bar{y}) +\sum_i(\hat{y}_i-\bar{y})^2$$

Where:

  1. 1st term residual sum of squares
  2. 2nd term is the covariance between residuals and the predicted values
  3. 3rd term is the explained sum of squares.

There's a few things I don't understand:

  1. Why would a correlation between residuals and predicted values mean there are better values of $\hat y$?
  2. Why is the second term covariance? I've tried to solve it on paper, but I'm getting this extra divide by N (number of data points).

$$2\sum_i ({y}_i-\hat{y}_i)(\hat{y}_i-\bar{y})=2\sum_i(y_i \hat y-\bar y_i ^2 + \hat y_i \bar y – y_i \bar y)$$


\begin{align}
cov(x, y) & = E[XY]-E[X]E[Y] \\
cov((y_i-\hat y_i), \hat y_i) & = E[(y_i -\hat y_i)(\hat y_i)]-E[(y_i-\hat y_i)]E[\hat y_i] \\
E[\hat y_i] & = \bar y \text{ if perfect prediction} \\
& =E[(y_i-\hat y_i)(\hat y_i)]-E[(y_i-\hat y_i)]\bar y \\
& =E[(y_i-\hat y_i)(\hat y_i)]-E[\bar y(y_i-\hat y_i)] \\
& =E[(y_i\hat y_i-\hat y_i\hat y_i]-E[(y_i \bar y-\hat y_i \bar y)] \\
& =E[(y_i\hat y_i-\hat y_i\hat y_i]+E[-(y_i \bar y-\hat y_i \bar y)] \\
& =E[(y_i\hat y_i-\hat y_i\hat y_i]+E[(-y_i \bar y+\hat y_i \bar y)] \\
& =E[(y_i\hat y_i-\hat y_i\hat y_i-y_i \bar y+\hat y_i \bar y)] \\
& =\frac{\sum_i[(y_i\hat y_i-\hat y_i\hat y_i-y_i \bar y+\hat y_i \bar y)]}{N} \\
\end{align}

From the above computation, $covariance \ne \displaystyle 2\sum_i ({y}_i-\hat{y}_i)(\hat{y}_i-\bar{y})$

I think I'm either misinterpreting or doing something incorrect?


In response to: $H$ is a really important matrix and it's worth taking the time to understand it. First, note that it's symmetric (you can prove this by showing $H^T=H$). Then prove it's idempotent by showing $H^22=H$. This all means that $H$ is a projection matrix, and $H$ projects a vector $v∈Rn$ into the p-dimensional subspace spanned by the columns of X. It turns out that $I−H$ is also a projection, and this projects a vector into the space orthogonal to the space that $H$ projects into.

Let's assume $X$ is a 2×2 matrix:

$$
\begin{bmatrix}
1 & x_1 \\
1 & x_2 \\
\end{bmatrix}
$$

Then $X^T$:
$$
\begin{bmatrix}
1 & 1 \\
x_1 & x_2 \\
\end{bmatrix}
$$


Compute $X^TX$

$
\begin{bmatrix}
1 & 1 \\
x_1 & x_2 \\
\end{bmatrix}
$
$
\begin{bmatrix}
1 & x_1 \\
1 & x_2 \\
\end{bmatrix}
$
$=$
$
\begin{bmatrix}
2 & x_1+x_2 \\
x_1+x_2 & x_1^2+x_2^2 \\
\end{bmatrix}
$


Compute
$(X^TX)^{-1}$

$ A =
\begin{bmatrix}
a & b \\
c & d \\
\end{bmatrix}
$
$ A^{-1} = \frac{1}{|A|}
\begin{bmatrix}
d & -b \\
-c & a \\
\end{bmatrix}
$
$ A^{-1} = \frac{1}{ad-bc}
\begin{bmatrix}
d & -b \\
-c & a \\
\end{bmatrix}
$

$ (X^TX)^{-1} =
\begin{bmatrix}
\frac{x_1^2+x_2^2}{2x_1^2+2x_2-(x^2_1+2x_1x_2+x^2_2)} & \frac{-(x_1+x_2)}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2))} \\
\frac{-(x_1+x_2)}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} & \frac{2}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} \\
\end{bmatrix}
$


Compute
$X(X^TX)^{-1}$

$X(X^TX)^{-1}
=
$
$
\begin{bmatrix}
1 & x_1 \\
1 & x_2 \\
\end{bmatrix}
$
$
\begin{bmatrix}
\frac{x_1^2+x_2^2}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} & \frac{-(x_1+x_2)}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} \\
\frac{-(x_1+x_2)}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} & \frac{2}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} \\
\end{bmatrix}
$
$=
\begin{bmatrix}
\frac{x^2_2-x_1x_2}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} & \frac{x_1-x_2}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} \\
\frac{x_1^2-x_1x_2}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} & \frac{x_2-x_1}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} \\
\end{bmatrix}
$


Compute
$X(X^TX)^{-1}X^T$

$X(X^TX)^{-1}X^T
=
$
$
\begin{bmatrix}
\frac{x^2_2-x_1x_2}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} & \frac{x_1-x_2}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} \\
\frac{x_1^2-x_1x_2}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} & \frac{x_2-x_1}{2x_1^2+2x_2^2-(x^2_1+2x_1x_2+x^2_2)} \\
\end{bmatrix}
$
$
\begin{bmatrix}
1 & 1 \\
x_1 & x_2 \\
\end{bmatrix}
$
$=
\begin{bmatrix}
1 & 0 \\
0 & 1 \\
\end{bmatrix}
$

Is an identity matrix

Best Answer

I'm going to assume this is all in the context of a linear model $Y = X\beta + \varepsilon$. Letting $H = X(X^T X)^{-1}X^T$, we have fitted values $\hat Y = H Y$ and residuals $e = Y - \hat Y = (I - H)Y$. For the second term in your expression, $$ \sum_i (y_i - \hat y_i)(\hat y_i - \bar y) = \langle e, HY - \bar y \mathbb 1\rangle $$ (where $\mathbb 1$ is the vector of all $1$'s and $\langle ., .\rangle$ is the standard inner product) $$ = \langle (I-H)Y, HY - \bar y \mathbb 1\rangle = Y^T (I-H)HY - \bar y Y^T (I-H) \mathbb 1. $$ Assuming we have an intercept in our model, $\mathbb 1$ is in the span of the columns of $X$ so $(I-H)\mathbb 1 = 0$. We also know that $H$ is idempotent so $(I-H)H = H-H^2 = H-H = 0$ therefore $\sum_i (y_i - \hat y_i)(\hat y_i - \bar y) = 0$.

This tells us that the residuals are necessarily uncorrelated with the fitted values. This makes sense because the fitted values are the projection of $Y$ into the column space, while the residuals are the projection of $Y$ into the space orthogonal to the column space of $X$. These two vectors are necessarily orthogonal, i.e. uncorrelated.

By showing that, under this model, $\sum_i (y_i - \hat y_i)(\hat y_i - \bar y) = 0$, we have proved that $$ \sum_i(y_i - \bar y)^2 = \sum_i(y_i - \hat y_i)^2 + \sum_i(\hat y_i - \bar y)^2 $$ which is a well-known decomposition.

To answer your question about why correlation between $e$ and $\hat Y$ means there are better values possible, I think you really need to consider the geometric picture of linear regression as shown below, for example:

lin regr (taken from random_guy's answer here).

If we have two centered vectors $a$ and $b$, the (sample) correlation between them is $$ cor(a, b) = \frac{\sum_i a_ib_i}{\sqrt{\sum_i a_i^2 \sum b_i^2}} = \cos \theta $$ where $\theta$ is the angle between them. If this is new to you, you can read more about it here.

Linear regression by definition seeks to minimize $\sum_i e_i^2$. Looking at the picture, we can see that this is the squared length of the vector $\hat \varepsilon$, and we know that this length will be the shortest when the angle between $\hat \varepsilon$ and $\hat Y$ is $90^o$ (if that's not clear, imagine moving the point given by the tip of the vector $\hat Y$ in the picture and see what happens to the length of $\hat \varepsilon$). Since $\cos 90^o = 0$ these two vectors are uncorrelated. If this angle is not $90^o$, i.e. $\sum_i e_i \hat y_i \neq 0 \implies \cos \theta \neq 0$, then we don't have the $\hat Y$ that's as close as possible.

To answer your question about how the term $\sum_i (y_i - \hat y_i)(\hat y_i - \bar y)$ is a covariance, you need to remember that this is a sample covariance, not the covariance between random variables. As I showed above, that's always 0. Note that $$ \sum_i (y_i - \hat y_i)(\hat y_i - \bar y) = \sum_i ([y_i - \hat y_i] - 0)([\hat y_i] - \bar y). $$ Noting that the sample average of $y_i - \hat y_i = 0$, and the sample average of $\hat y_i = \bar y$, we have that this is a sample covariance by definition.