In my textbook, there is a statement mentioned on the topic of linear regression/machine learning, and a question, which is simply quoted as,
Consider a noisy target, $ y = (w^{*})^T \textbf{x} + \epsilon $, for generating the data, where $\epsilon$ is a noise term with zero mean and $\sigma^2$ variance, independently generated for every example $(\textbf{x},y)$. The expected error of the best possible linear fit to this target is thus $\sigma^2$.
For the data $D = \{ (\textbf{x}_1,y_1), …, (\textbf{x}_N,y_N) \}$, denote the noise in $y_n$ as $\epsilon_n$, and let $ \mathbf{\epsilon} = [\epsilon_1, \epsilon_2, …\epsilon_N]^T$; assume that $X^TX$ is invertible. By following the steps below, show that the expected in-sample error of linear regression with respect to $D$ is given by,
$ \mathbb{E}_D[E_{in}( \textbf{w}_{lin} )] = \sigma^2 (1 – \frac{d+1}{N})$
Below is my methodology,
Book says that,
In-sample error vector, $\hat{\textbf{y}} – \textbf{y}$, can be expressed as $(H-I)\epsilon$, which is simply, hat matrix, $H= X(X^TX)^{-1}X^T$, times, error vector, $\epsilon$.
So, I calculated in-sample error, $E_{in}( \textbf{w}_{lin} )$, as,
$E_{in}( \textbf{w}_{lin} ) = \frac{1}{N}(\hat{\textbf{y}} – \textbf{y})^T (\hat{\textbf{y}} – \textbf{y}) = \frac{1}{N} (\epsilon^T (H-I)^T (H-I) \epsilon)$
Since it is given by the book that,
$(I-H)^K = (I-H)$, and also $(I-H)$ is symetric, $trace(H) = d+1$
I got the following simplified expression,
$E_{in}( \textbf{w}_{lin} ) =\frac{1}{N} (\epsilon^T (H-I)^T (H-I) \epsilon) = \frac{1}{N} \epsilon^T (I-H) \epsilon = \frac{1}{N} \epsilon^T \epsilon – \frac{1}{N} \epsilon^T H \epsilon$
Here, I see that,
$\mathbb{E}_D[\frac{1}{N} \epsilon^T \epsilon] = \frac {N \sigma^2}{N}$
And, also, the sum formed by $ – \frac{1}{N} \epsilon^T H \epsilon$, gives the following sum,
$ – \frac{1}{N} \epsilon^T H \epsilon = – \frac{1}{N} \{ \sum_{i=1}^{N} H_{ii} \epsilon_i^2 + \sum_{i,j \ \in \ \{1..N\} \ and \ i \neq j}^{} \ H_{ij} \ \epsilon_i \ \epsilon_j \}$
I undestand that,
$ – \frac{1}{N} \mathbb{E}_D[\sum_{i=1}^{N} H_{ii} \epsilon_i^2] = – trace(H) \ \sigma^2 = – (d+1) \ \sigma^2$
However, I don't understand why,
$ – \frac{1}{N} \mathbb{E}_D[\sum_{i,j \ \in \ \{1..N\} \ and \ i \neq j}^{} \ H_{ij} \ \epsilon_i \ \epsilon_j ] = 0$ $\ \ \ \ \ \ \ \ \ \ \ \ (eq \ 1)$
$(eq 1)$ should be equal to $0$ in order to satisfy the equation,
$ \mathbb{E}_D[E_{in}( \textbf{w}_{lin} )] = \sigma^2 (1 – \frac{d+1}{N})$
Can any one mind to explain me why $(eq1)$ leads to a zero result ?
Best Answer
As you write above, I start from here.
Let it be :
$$\boldsymbol {E = E_{(N,1)} = {\hat y - y}} $$
$$\begin{align} E_{in}(w_{lin}) &= \frac{1}{N} E^T E \\ &= \frac{1}{N} (\epsilon^T (H-I)^T (H-I) \epsilon) \\ &= \frac{1}{N}(\epsilon^T (I-H) \epsilon) \end{align}$$
then
$$\begin{align} \Bbb E_D \left[ E_{in}(w_{lin})\right] &= \frac{1}{N} \Bbb E_D[\epsilon^T (I-H) \epsilon] \\ &= \frac{1}{N} (\Bbb E_D[\epsilon^T \epsilon] - \Bbb E_D[\epsilon^T H \epsilon]) \\ &= \sigma^2 (1 - \frac{d+1}{N}) \end{align}$$
Because:
since $$\epsilon_i, \epsilon_j$$ are independent, $$\Bbb E_D[\epsilon_i\epsilon_j] = \Bbb E_D[\epsilon_i] \Bbb E_D[\epsilon_j] = 0$$, and $$\Bbb E(\epsilon_i) = 0$$, $$\Bbb E(\epsilon_i^2) = \sigma^2$$