Solved – Derivation of EPE for linear regression in “The elements of statistical learning “

machine learningregression

My question is about "The elements of statistical learning" book (yup, the one).
Right now I am kinda stuck on second chapter at part, where they derive EPE for linear regression (Somewhat related to Confusion about derivation of regression function , but I have more a detailed investigation here =)).
Here it is:

enter image description here

Ok, I do not really get where did they get it from, but it is not so important, because I've found notes for the book from authors (A Solution Manual and Notes for: The Elements of Statistical Learning).

There authors try to explain how this formula was derived. But I do not get that either =)

So here is the logic that they have (Just a part, other is not important right now):

enter image description here

I do not really get a lot of things here. Let me first explain how I see variables here:

$ T(\tau)$ – just all samples, basically iid, and distribution would be likelihood of all samples

$X$ – input RV

$Y$ – output RV

$x_0$ – is selected input (independent of $T$, just some random $x$)

$y_0$ – output calculated based on $x_0$ (RV, because it depends on e – normal error)

  1. What is $E_{{y_0}|{x_0}}$, how does it differ from $E_{Y|X}$?
    I mean $X$ an is input and $Y$ is an output, which have quite straightforward relation $Y = \beta X + e$. So $E_{Y|X}$ is expectation of output based on input. And $E_{{y_0}|{x_0}}$ seems pretty much the same to me, though I kinda understand, that $y_0$ is some output given that we have chosen $x_0$ (as it seems it is constant here, though I guess it is weird condition on constant). But, maybe, it is more broader thing, don't know.

  2. How this is true: $E_T =E_XE_{Y|X} $

  3. And the one that blew my head off: $E_TU_1 = U_1E_T$.
    What does it mean? =) I understand expectation for Random Variable, but what is $U_1E_T$, how am I supposed to understand it? Seems like $E_TU_1$ is just constant given that we've chosen $y_0$ and $x_0$, but I can be totally wrong here.

I have many more questions, but for now I will stop here. I have ok background in stats and linear algebra, I understand undergrad level books (without measure theory though). But I really can't grasp how authors use expectations.

Best Answer

I am trying to answer the first question Suppose $x_0$ and $y_0$ are both in $R^p$. $T$ is the space of training parameters which are sets of pairs $(x_0,y_0)$ such that $y_0=x_0^T\beta + \epsilon$. These sets define exactly the training data and starting from there an linear estimation of $y$ as a function of $x$ is calculated with the linear regression method. Let $g:R^p \mapsto R^p$ be a function. Then $E_{x_0|y_0}g = \int_A\!g(y)\rho(y|x_0)\,\mathrm{d}y$ where $A= \{y=x_0^T\beta + \epsilon, where \epsilon \in N(0,\sigma) \}$ . Here $\rho$ is the conditional probability density function of $y$ when $x=x_0$. In our case $g(y)=E_T(y - \hat{y})^2$

For the second question $X$ and $Y$ denote ordered sets of $R^p$ (the input $x$), respectively the sets of corresponding $y$.

For the third question $E_T U_1 = U_1 E_T$ when $U_1$ is a constant with respect to the integration variable in the estimation integral.

Related Question