Proof of $\frac{1}{n}\mathrm{E} \left[ \| \mathbf{X}\mathbf{\hat{w}} – \mathbf{X}\mathbf{w}^{*} \|^{2}_{2} \right] = \sigma^{2}\frac{d}{n}$

I am trying to find a proof for the MSE of a linear regression:

\begin{gather}
\frac{1}{n}\mathrm{E} \left[ \| \mathbf{X}\mathbf{\hat{w}} – \mathbf{X}\mathbf{w}^{*} \|^{2}_{2} \right] = \sigma^{2}\frac{d}{n}
\end{gather}

The variables are defined as follows:

$\mathbf{X} \in \mathbb{R}^{n \times d}$: full column rank feature matrix with $n$ features in $d$ dimensions
$\mathbf{z} \in \mathbb{R}^{n}$: gaussian distributed noise vector with $\mathcal{N}(0, \mathrm{diag}(\sigma^2, \cdots, \sigma^2))$
$\mathbf{y} \in \mathbb{R}^{n}$: noisy measurement of a true signal $\mathbf{y}^{*}$ with additive gaussian noise $\mathbf{y} = \mathbf{y}^{*} + \mathbf{z}$

The computed optimal weights are provided by multiplying the above $\mathbf{y}$ with the pseudo-inverse of $\mathbf{X}$:

\begin{gather}
\mathbf{\hat{w}} = (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y} = (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{y}^{*} + \mathbf{z})
\end{gather}

With the true optimal weights denoted by

$$\mathbf{w}^{*} = (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y}^{*}$$

The expectation is taken over the noise-vector $\mathbf{z}$ with all other variables assumed to be determined/non-probabilistic.

So far I have tried all kinds of shenanigans with the SVD $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\intercal$ but could only come to the result of

$$\frac{1}{n}\mathrm{diag}(\sigma^2, \dots, \sigma^2)$$

My main problem is figuring out how $d$ gets into the equation.

Best Answer

First, not that $w^* = (X^TX)^{-1}X^T y$ is generally incorrect, because $X^TX$ might be singular. Instead, it is expressed in terms of the pseudoinverse $w^* = X^+y$.

The solution then is to use the famous trace-trick:

$$\begin{aligned} \tfrac{1}{n}𝐄\big[‖X\hat{w} - Xw^* ‖_2^2\big] &= \tfrac{1}{n}𝐄\big[‖XX^+z‖_2^2\big] \\&= \tfrac{1}{n}𝐄[z^⊤(XX^+)^⊤ XX^+z] \\&= \tfrac{1}{n}𝐄[z^⊤XX^+z] \\&= \tfrac{1}{n}𝐄[𝐭𝐫(z^⊤XX^+z)] \\&= \tfrac{1}{n}𝐄[𝐭𝐫(XX^+zz^⊤)] \\&= \tfrac{1}{n}𝐭𝐫(𝐄[XX^+zz^⊤]) \\&= \tfrac{1}{n}𝐭𝐫(XX^+⋅𝐄[zz^⊤]) \\&= \tfrac{1}{n}𝐭𝐫(XX^+⋅σ^2 𝕀_n) \\&= \tfrac{1}{n}σ^2 𝐭𝐫(XX^+) \\&= \tfrac{1}{n}σ^2 \mathbf{rank}(X) \end{aligned}$$

In particular, if $n≥d$ and $X$ has full column rank, then

$$\tfrac{1}{n}𝐄\big[‖X\hat{w} - Xw^* ‖_2^2\big] = \tfrac{d}{n}σ^2$$

Best Answer

Related Solutions

[Math] Proving Convergence of Least Squares Regression with i.i.d. Gaussian Noise

Related Question