[Math] Covariance of y-hat and y, $\mathrm{cov}(\hat{Y},Y)$ with unrestricted (non-spherical) error variance

linear algebraprojection-matricesregression

Consider the standard regression equation
$$
Y = X\beta + \varepsilon
$$
with predicted value $\hat{Y} = P_XY$ where $P_X = X(X'X)^{-1}X'$ is the projection onto $X$ or the "hat" matrix. Also let $\mathrm{Var}(\varepsilon) = \Sigma$ be an arbitrary symmetric (positive semi-definite) matrix.

The question: What is $\mathrm{cov}(\hat{Y}, Y)$?

This question is answered here for the case where $\Sigma = \sigma^2I$, but with general $\Sigma$ the two derivations there do not lead to the same answer.

Derivation 1

$$
\mathrm{cov}(\hat{Y}, Y) = \mathrm{cov}(P_XY, Y) = P_X \mathrm{cov}(Y, Y) = P_X\mathrm{Var}(Y) = P_X\Sigma
$$

Derivation 2

Note $\hat{\varepsilon} = Y – \hat{Y}$ and $\mathrm{cov}(\hat{Y}, \hat{\varepsilon}) =0 $ (Edit: This is not true for $\Sigma\neq\sigma^2I$). Then

$$
\mathrm{cov}(\hat{Y}, Y) = \mathrm{cov}(\hat{Y}, \hat{Y} + \hat{\varepsilon}) = \mathrm{Var}(\hat{Y}) = \mathrm{Var}(P_XY) = P_X\mathrm{Var}(Y)P_X = P_X\Sigma P_X \neq P_X\Sigma
$$

Derivation 2, version 2.

(Not included in the linked post.)

Recall $\hat{Y} = X\hat{\beta}$ where $\hat{\beta} = (X'X)^{-1}X'Y$ and $\mathrm{Var}(\hat{\beta}) = (X'X)^{-1}X'\Sigma X(X'X)^{-1}$. Then

$$
\mathrm{cov}(\hat{Y}, Y) = \mathrm{cov}(\hat{Y}, \hat{Y} + \hat{\varepsilon}) = \mathrm{Var}(\hat{Y}) = \mathrm{var}(X\hat{\beta}) = X\mathrm{var}(\hat{\beta})X' = P_X\Sigma P_X
$$

Note that when spherical errors are assumed, so $\Sigma = \sigma^2I$, all of the above collapse to $\sigma^2P_X$, which is what the previous post found.

Where is the error?

I believe Derivation 2 is probably the correct one, but Derivation 1 seems so straightforward that I'm not sure what the problem is.

As a bonus, the following Python code numerically shows a few of the above identities, in particular $cov(\hat{Y}, Y) = Var(\hat{Y})$, though for the case of spherical errors, obviously.

import numpy as np
import scipy.linalg as la

# Setup
N = 100
y = np.random.normal(size=N)
X = np.random.normal(size=(N, 2))
X[:, 0] = 1
# Regression
P = X.dot(la.inv(X.T.dot(X))).dot(X.T)
yhat = P.dot(y)
ehat = y - yhat
# Covariances
cov1 = np.cov(yhat, y)
cov2 = np.cov(yhat, ehat)
print cov1[0, 0] - cov1[0, 1]   # var(yhat) = cov(yhat, y)
print cov2[0, 1]                # cov(yhat, ehat) = 0

Edit:

Solution (Correction to Derivation 2)

Many thanks to Arin Chaudhuri for pointing out my mistake. With general $\Sigma$, we have $\mathrm{cov}(\hat{Y}, \hat{\varepsilon}) = P_X\Sigma (I – P_X)$. If $\Sigma = \sigma^2I$, then this reduces to $0$, but otherwise is non-zero. Adding that to Derivation 2 yields the correct answer:
$$
\mathrm{cov}(\hat{Y}, Y) = P_X\Sigma
$$

Best Answer

Your assumption $\text{Cov}(\hat{Y},\hat{\epsilon}) = 0$ may not be correct . My reasoning follows:

The cross covariance between any two random vectors $X_1$ and $X_2$ is $ \text{E}(X_1 - \text{E}X_1)(X_2 - \text{E}X_2)^T.$

So $\text{Cov}(\hat{Y},\hat{\epsilon}) = \text{E}(\hat{Y} - E\hat{Y})(\hat{\epsilon}-E\hat{\epsilon})^T. $

Now $\hat{Y} = P_X Y = P_X( X\beta + \epsilon) = X\beta + P_X\epsilon.$

So $E\hat{Y} = X\beta$ and $\hat{Y} - E\hat{Y} = P_X\epsilon.$

And $\hat{\epsilon} = Y - \hat{Y} = (I-P_X)Y$ so $E\hat{\epsilon} = 0$ and $\hat{\epsilon} - \text{E}\hat{\epsilon} = (I-P_X)Y = (I-P_X)(X\beta + \epsilon) = (I-P_X)\epsilon.$

So $\text{Cov}(\hat{Y},\hat{\epsilon}) = E( (P_X\epsilon) ((I-P_X)\epsilon)^T) = \text{E}(P_X \epsilon \epsilon^T (I-P_X)) = P_X \text{E}(\epsilon \epsilon^T) (I-P_X) = P_X \Sigma(I-P_X) = P_X\Sigma - P_X\Sigma P_X.$