I’m trying to understand the following derivation

statistics

This is taken from Hastie, Tibshirani and Friedman's Elements of Statistical Learning. They describe the squared loss error and to minimize the expectation of this error, which they call Expected (squared) prediction error, $EPE$. They first define the squared error loss as $L(Y, f(X))=(Y-f(X))^{2}$. The criterion for choosing $f$ becomes, $$EPE(f)=E(Y-f(X))^{2}$$
$$EPE(f)=\int(y-f(x))^{2}P(dx, dy)$$
The authors then write the following: the condition on $X$ to get, $$EPE(f)=E_{X}E_{Y|X}([Y-f(X)]^{2}|X)$$
I think I might be missing something basic here, but how did the authors arrive at this by conditioning on X?

Then they go on to write: it suffices to minimize $EPE$ pointwise as, $$f(x) = argmin_{c}E_{Y|X}([Y-c]^{2}|X=x)$$
How do the authors arrive at this conclusion i.e. that it suffices to minimize EPE pointwise? This is not very intuitive to me.

In the end, the authors say: the solution for the equation above is, $$f(x)=E(Y|X=x)$$
How do they arrive at this solution? Could someone describe the intermediate math to achieve this as a solution?

Best Answer

In addition to what's already been answered, I hope this derivation helps.

$$\text{EPE}(f)=\int [y-f(x)]^2\text{Pr}(dx,dy)$$ $$=\int [y-f(x)]^2p(x,y)dxdy$$ $$=\int_x\int_y [y-f(x)]^2p(x,y)dxdy$$ $$=\int_x\int_y [y-f(x)]^2p(y|x)p(x)dxdy$$ $$=\int_x\left(\int_y[y-f(x)]^2p(y|x)dy\right)p(x)dx$$ $$=\int_xE_{Y|X}\left([Y-f(X)]^2|X\right)p(x)dx$$ $$=E_XE_{Y|X}\left([Y-f(X)]^2|X\right)$$

Using slightly different notation to that of the book, $$f(x)=\text{argmin}_fE_{Y|X}([Y-f]^2|X=x)$$ $$\frac{\partial{EPE(f)}}{\partial{f}}=\frac{\partial}{\partial{f}}\left(E([Y-f]^2|X=x\right)=0$$ $$\frac{\partial}{\partial{f}}\left(E(Y^2-2Yf+f^2|X=x\right)=0$$ $$\frac{\partial}{\partial{f}}\left(E(Y^2|X=x)-2fE(Y|X=x)+f^2\right)=0$$ $$0-2E(Y|X=x)+2f=0$$ $$f(X)=f^*=E(Y|X=x)$$