Machine Learning – Expected Squared Prediction Error Conditioned on Training Set

machine learningprobabilitystatistics

I'm reading Elements of Statistical Learning by Hastie and Tibshirani, and I am thoroughly confused by the way they conditioned expected squared prediction error in section 2.5 (p.26):
\begin{align*}
EPE(x_0) &= E_{y_0|x_0} E_{\mathcal{T}} (y_0 – \hat{y}_0)^2
\end{align*}

I think $\mathcal{T}$ refers to the training set, and $(x_0, y_0)$ is the testing set. What is the joint distribution that $EPE(x_0)$ is evaluated with? I can't make sense of what the distribution $f(y_0|x_0)*\pi(\mathcal{T})$ even means. I've seen many questions asked about their earlier definition of the $EPE$ (p.18):
\begin{align*}
EPE(f) &= E_X E_{Y|X} ([Y – f(X)]^2|X)
\end{align*}
Here, the conditioning makes sense. I can see that the $EPE$ is with respect to the joint distribution of $X$ and $Y$, where $X$ is the input vector and $Y$ is the output vector. Could someone please explain why the $EPE(x_0)$ written on top makes sense?

Best Answer

The notation in that discussion is thoroughly confusing. In the derivation of $EPE(x_0)$ it's unclear what $E_{y_0|x_0}E_{\cal T}$ means. I don't see it as representing a calculation like $E(Z)=E_W(E(Z\mid W))$ by first conditioning on $W$, unless there is some inconsistency in notation somewhere.

IMHO a less confusing development would be to write $$ PE:=(y_0-\hat y_0)^2 = [(y_0-x_0^T\beta) + (x_0^T\beta - \hat y_0)]^2=:(A-B)^2\tag1, $$ say, and then take expectations conditional on the test point $x_0$ and training data $\cal T$. Under this assumption, the terms $A$ and $B$ are independent (since the first depends on the error $\varepsilon$ on the new observation $y_0$ and the second depends on the errors on the training data) and $A$ has expectation zero (since the error $\varepsilon$ has mean zero). So the cross term $AB$ has expectation zero and the (conditional) EPE is then $$ E(PE\mid x_0,{\cal T}) = E(A^2\mid x_0,{\cal T}) + E(B^2\mid x_0,{\cal T}).\tag2$$ We note that the first term on the RHS is $\sigma^2$ and the second is $V(\hat y_0\mid x_0,\cal T)$ since $\hat y_0$ is an unbiased estimator of $x_0^T\beta$, so (2) becomes $$ E(PE\mid x_0,{\cal T})=\sigma^2+x_0^T(X^TX)^{-1}x_0\sigma^2.\tag3 $$ Then take conditional expectation of (3) wrt $x_0$ to get $$ E(PE\mid x_0)=\sigma^2+x_0^TE_{\cal T}[(X^TX)^{-1}]x_0\sigma^2, $$ which is the final formula in (2.27).

Related Question