Solved – Prediction error in least squares with a linear model

errorleast squares

In the classical linear model with
$$Y=X\beta +\epsilon,$$
where $Y \in \mathbb{R}^n$ is the observation, $X\in \mathbb{R}^{n\times p}$ is the known covariates, $\beta \in \mathbb{R}^p$ is the unknown parameter with, $p < n$, and $\epsilon \in \mathbb{R}^{n}$, $\epsilon \sim \mathcal{N}(0,\sigma^{2}I)$.

The classical least squares estimator here would be
$$\hat{\beta}= (X^TX)^{-1}X^TY.$$

If $\beta^{0}$ is the true parameter, then we have that the prediction error is given by $E=||X(\hat{\beta}- \beta^{0})||_2^2$. We have that,
$$E=||X(\hat{\beta}- \beta^{0})||_2^2/\sigma^2=||X(X^TX)^{-1}X^T \epsilon||_2^2/\sigma^2=||\gamma||_2^2/\sigma^2.$$

It is easy to see that $\gamma \sim \mathcal{N}(0,\sigma^2 X(X^TX)^{-1}X^T)$. However it is claimed in High Dimensional Statistics by Bühlmann and van de Geer (on page 101) that $E$ is distributed according to a chi-square distribution with $p$ degrees of freedom. I can not see how this is true (It would be true if $\gamma \sim \mathcal{N}(0,D)$ for some diagonal matrix $D$ with non-negative diagonal entries.) Am I missing something here?

Best Answer

Your last statement provides an important clue: not only would $D$ be diagonal, it would have to have $p$ units on the diagonal and zeros elsewhere. So there must be something special about $X(X^\prime X)^{-1}X^\prime$. To see what, look at the Singular Value Decomposition of $X$,

$$X = U \Sigma V^\prime$$

where $U$ and $V$ are orthogonal (that is, $U^\prime U$ and $V^\prime V$ are identity matrices) and $\Sigma$ is diagonal. Use this to simplify

$$(X^\prime X)^{-1} = \left( \left( U \Sigma V^\prime \right)^\prime \left( U \Sigma V^\prime \right)\right)^{-1} = \left( V \Sigma^\prime U^\prime U \Sigma V^\prime \right)^{-1} = \left( V \Sigma^\prime \Sigma V^\prime \right)^{-1}$$

and employ that to compute

$$X(X^\prime X)^{-1}X^\prime =\left( U \Sigma V^\prime \right) \left( V \Sigma^\prime \Sigma V^\prime \right)^{-1}\left( U \Sigma V^\prime \right)^\prime = U\left(\Sigma\left(\Sigma^\prime \Sigma\right)^{-1}\Sigma^\prime\right) U^\prime.$$

This exhibits the covariance matrix of $\epsilon$ as being conjugate (via the similarity induced by $U$) to $\Sigma\left(\Sigma^\prime \Sigma\right)\Sigma^\prime$, which (since $\Sigma$ is diagonal) has $\text{rank}(\Sigma)$ ones along the diagonal and zeros elsewhere: in other words, the distribution of $\epsilon$ is that of an orthogonal linear combination of $\text{rank}(\Sigma) = \text{rank}(X)$ independent, identically distributed Normal variates.

Orthogonal transformations such as $U$ preserve sums of squares. Provided $X$ has full rank--which is $\min(p,n)=p$--the distribution of $E$ therefore is that of the sum of $p$ squares of independent standard Normal variables, which by definition is $\chi^2(p)$. More generally, $E \sim \chi^2(\text{rank}(X)).$


This algebraic argument is one way of finding out that ordinary least squares is just Euclidean geometry: this result is a rediscovery of the Pythagorean Theorem.

Related Question