[Math] Expected in-sample error of linear regression with respect to a dataset D

machine learningprobabilityregressionstatistics

In my textbook, there is a statement mentioned on the topic of linear regression/machine learning, and a question, which is simply quoted as,

Consider a noisy target, $ y = (w^{*})^T \textbf{x} + \epsilon $, for generating the data, where $\epsilon$ is a noise term with zero mean and $\sigma^2$ variance, independently generated for every example $(\textbf{x},y)$. The expected error of the best possible linear fit to this target is thus $\sigma^2$.

For the data $D = \{ (\textbf{x}_1,y_1), …, (\textbf{x}_N,y_N) \}$, denote the noise in $y_n$ as $\epsilon_n$, and let $ \mathbf{\epsilon} = [\epsilon_1, \epsilon_2, …\epsilon_N]^T$; assume that $X^TX$ is invertible. By following the steps below, show that the expected in-sample error of linear regression with respect to $D$ is given by,

$ \mathbb{E}_D[E_{in}( \textbf{w}_{lin} )] = \sigma^2 (1 – \frac{d+1}{N})$

Below is my methodology,

Book says that,

In-sample error vector, $\hat{\textbf{y}} – \textbf{y}$, can be expressed as $(H-I)\epsilon$, which is simply, hat matrix, $H= X(X^TX)^{-1}X^T$, times, error vector, $\epsilon$.

So, I calculated in-sample error, $E_{in}( \textbf{w}_{lin} )$, as,

$E_{in}( \textbf{w}_{lin} ) = \frac{1}{N}(\hat{\textbf{y}} – \textbf{y})^T (\hat{\textbf{y}} – \textbf{y}) = \frac{1}{N} (\epsilon^T (H-I)^T (H-I) \epsilon)$

Since it is given by the book that,

$(I-H)^K = (I-H)$, and also $(I-H)$ is symetric, $trace(H) = d+1$

I got the following simplified expression,

$E_{in}( \textbf{w}_{lin} ) =\frac{1}{N} (\epsilon^T (H-I)^T (H-I) \epsilon) = \frac{1}{N} \epsilon^T (I-H) \epsilon = \frac{1}{N} \epsilon^T \epsilon – \frac{1}{N} \epsilon^T H \epsilon$

Here, I see that,

$\mathbb{E}_D[\frac{1}{N} \epsilon^T \epsilon] = \frac {N \sigma^2}{N}$

And, also, the sum formed by $ – \frac{1}{N} \epsilon^T H \epsilon$, gives the following sum,

$ – \frac{1}{N} \epsilon^T H \epsilon = – \frac{1}{N} \{ \sum_{i=1}^{N} H_{ii} \epsilon_i^2 + \sum_{i,j \ \in \ \{1..N\} \ and \ i \neq j}^{} \ H_{ij} \ \epsilon_i \ \epsilon_j \}$

I undestand that,

$ – \frac{1}{N} \mathbb{E}_D[\sum_{i=1}^{N} H_{ii} \epsilon_i^2] = – trace(H) \ \sigma^2 = – (d+1) \ \sigma^2$

However, I don't understand why,

$ – \frac{1}{N} \mathbb{E}_D[\sum_{i,j \ \in \ \{1..N\} \ and \ i \neq j}^{} \ H_{ij} \ \epsilon_i \ \epsilon_j ] = 0$ $\ \ \ \ \ \ \ \ \ \ \ \ (eq \ 1)$

$(eq 1)$ should be equal to $0$ in order to satisfy the equation,

$ \mathbb{E}_D[E_{in}( \textbf{w}_{lin} )] = \sigma^2 (1 – \frac{d+1}{N})$

Can any one mind to explain me why $(eq1)$ leads to a zero result ?

Best Answer

As you write above, I start from here.

Let it be :

$$\boldsymbol {E = E_{(N,1)} = {\hat y - y}} $$

$$\begin{align} E_{in}(w_{lin}) &= \frac{1}{N} E^T E \\ &= \frac{1}{N} (\epsilon^T (H-I)^T (H-I) \epsilon) \\ &= \frac{1}{N}(\epsilon^T (I-H) \epsilon) \end{align}$$

then

$$\begin{align} \Bbb E_D \left[ E_{in}(w_{lin})\right] &= \frac{1}{N} \Bbb E_D[\epsilon^T (I-H) \epsilon] \\ &= \frac{1}{N} (\Bbb E_D[\epsilon^T \epsilon] - \Bbb E_D[\epsilon^T H \epsilon]) \\ &= \sigma^2 (1 - \frac{d+1}{N}) \end{align}$$

Because:

(1) $$\begin{align} \Bbb E_D[\epsilon^T \epsilon] &= \Bbb E_D[\epsilon_1^2 + \epsilon_2^2 + \cdots + \epsilon_N^2] \\ &= \Bbb E_D[\epsilon_1^2] + \cdots + \Bbb E_D[\epsilon_N^2] \\ &= N \sigma^2 \end{align}$$

(2) $$\begin{align} \Bbb E_D[\epsilon^T H \epsilon] & = \Bbb E_D \left[ \sum_{i=1}^{N} H_{ii} \epsilon_i^2 + \sum_{i \neq j}^{N}H_{ij} \epsilon_i \epsilon_j \right] \\ &= \Bbb E_D[\sum_{i=1}^{N} H_{ii} \epsilon_i^2] + \Bbb E_D[\sum_{i \neq j}^{N}H_{ij} \epsilon_i \epsilon_j] \\ &= trace(\boldsymbol H) \sigma^2 + 0 \\ &= \sigma^2 (d+1) \end{align}$$


since $$\epsilon_i, \epsilon_j$$ are independent, $$\Bbb E_D[\epsilon_i\epsilon_j] = \Bbb E_D[\epsilon_i] \Bbb E_D[\epsilon_j] = 0$$, and $$\Bbb E(\epsilon_i) = 0$$, $$\Bbb E(\epsilon_i^2) = \sigma^2$$