In the classical linear model with
$$Y=X\beta +\epsilon,$$
where $Y \in \mathbb{R}^n$ is the observation, $X\in \mathbb{R}^{n\times p}$ is the known covariates, $\beta \in \mathbb{R}^p$ is the unknown parameter with, $p < n$, and $\epsilon \in \mathbb{R}^{n}$, $\epsilon \sim \mathcal{N}(0,\sigma^{2}I)$.
The classical least squares estimator here would be
$$\hat{\beta}= (X^TX)^{-1}X^TY.$$
If $\beta^{0}$ is the true parameter, then we have that the prediction error is given by $E=||X(\hat{\beta}- \beta^{0})||_2^2$. We have that,
$$E=||X(\hat{\beta}- \beta^{0})||_2^2/\sigma^2=||X(X^TX)^{-1}X^T \epsilon||_2^2/\sigma^2=||\gamma||_2^2/\sigma^2.$$
It is easy to see that $\gamma \sim \mathcal{N}(0,\sigma^2 X(X^TX)^{-1}X^T)$. However it is claimed in High Dimensional Statistics
by Bühlmann and van de Geer
(on page 101) that $E$ is distributed according to a chi-square distribution with $p$ degrees of freedom. I can not see how this is true (It would be true if $\gamma \sim \mathcal{N}(0,D)$ for some diagonal matrix $D$ with non-negative diagonal entries.) Am I missing something here?
Best Answer
Your last statement provides an important clue: not only would $D$ be diagonal, it would have to have $p$ units on the diagonal and zeros elsewhere. So there must be something special about $X(X^\prime X)^{-1}X^\prime$. To see what, look at the Singular Value Decomposition of $X$,
$$X = U \Sigma V^\prime$$
where $U$ and $V$ are orthogonal (that is, $U^\prime U$ and $V^\prime V$ are identity matrices) and $\Sigma$ is diagonal. Use this to simplify
$$(X^\prime X)^{-1} = \left( \left( U \Sigma V^\prime \right)^\prime \left( U \Sigma V^\prime \right)\right)^{-1} = \left( V \Sigma^\prime U^\prime U \Sigma V^\prime \right)^{-1} = \left( V \Sigma^\prime \Sigma V^\prime \right)^{-1}$$
and employ that to compute
$$X(X^\prime X)^{-1}X^\prime =\left( U \Sigma V^\prime \right) \left( V \Sigma^\prime \Sigma V^\prime \right)^{-1}\left( U \Sigma V^\prime \right)^\prime = U\left(\Sigma\left(\Sigma^\prime \Sigma\right)^{-1}\Sigma^\prime\right) U^\prime.$$
This exhibits the covariance matrix of $\epsilon$ as being conjugate (via the similarity induced by $U$) to $\Sigma\left(\Sigma^\prime \Sigma\right)\Sigma^\prime$, which (since $\Sigma$ is diagonal) has $\text{rank}(\Sigma)$ ones along the diagonal and zeros elsewhere: in other words, the distribution of $\epsilon$ is that of an orthogonal linear combination of $\text{rank}(\Sigma) = \text{rank}(X)$ independent, identically distributed Normal variates.
Orthogonal transformations such as $U$ preserve sums of squares. Provided $X$ has full rank--which is $\min(p,n)=p$--the distribution of $E$ therefore is that of the sum of $p$ squares of independent standard Normal variables, which by definition is $\chi^2(p)$. More generally, $E \sim \chi^2(\text{rank}(X)).$
This algebraic argument is one way of finding out that ordinary least squares is just Euclidean geometry: this result is a rediscovery of the Pythagorean Theorem.