Regression – Understanding the Derivation of Regression Function in Machine Learning

I just got a copy of The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman. In chapter 2 (Overview of Supervised Learning) section 4 (Statistical Decision Theory), he gives a derivation of the regression function.

Let $X \in \mathbb{R}^p$ denote a real valued random input vector, and $Y\in\mathbb{R}$ a real valued random output variable, with joint distribution $Pr(X,Y)$. We seek a function $f(X)$ for predicting $Y$ given values of the input $X$. This theory requires a loss function $L(Y,f(X))$ for penalizing errors in prediction, and by far the most common and convenient is squared error loss: $L(Y,f(X))=(Y −f(X))^2$. This leads us to a criterion for choosing $f$,

$$\begin{align*} EPE(f) &= E(Y-f(X))^2 \\ &= \int [y – f(x)]^2Pr(dx, dy)\end{align*}$$ the expected (squared) prediction error.

I completely understand the set up and motivation. My first confusion is: does he mean $E[(Y – f(x))]^2$ or $E[(Y – f(x))^2]$? Second, I have never seen the notation $Pr(dx,dy)$. Can someone who has explain its meaning to me? Is it just that $Pr(dx) = Pr(x)dx$? Alas my confusion does not end there,

By conditioning on $X$, we can write $EPE$ as $$\begin{align*}EPE(f) = E_XE_{Y|X}([Y-f(X)]^2|X)\end{align*}$$

I am missing the connection between these two steps, and I am not familiar with the technical definition of "conditioning". Let me know if I can clarify anything! I think most of my confusion has arisen from unfamiliar notation; I am confident that, if someone can break this derivation down into plain English, I'll get it. Thanks stats.SE!

Best Answer

For your first confusion, it should be Expectation of squared error, so it is $E[(Y-f(x))^2].$

For the notation of $Pr(dx,dy)$, it is equal to $g(x,y)\,dx\,dy$, where $g(x,y)$ is the joint pdf of x and y. And $Pr(dx)=f(x)\,dx$, this can be interpreted as the probability of x being within a tiny interval of $[x,x+dx]$ is equal to pdf value at the point $x$, i.e. $f(x)$ times the interval length $dx$.

The equation about the EPE stems from the theorem $E(E(Y|X))=E(Y)$ for any two random variables $X$ and $Y$. You can prove this by using the conditional distribution. The conditional expectation is the expectation calculated using the conditional distribution. The conditional distribution $Y|X$ means the probability of $Y$ after you know something about $X$.

In our case, suppose we denote the squared error as a function $L(x,y)=(y-f(x))^2$, the EPE is calculating

$$\begin{equation}\begin{split}E(L(x,y))&=\int\int L(x,y)g(x,y)dx\,dy \\ &=\int\bigg[\int L(x,y)g(y|x)g(x)dy\bigg]dx \\ &=\int\bigg[\int L(x,y)g(y|x)dy\bigg]g(x)dx \\ &=\int\bigg[E_{Y|X} (L(x,y)\bigg]g(x)dx \\ &=E_X(E_{Y|X} (L(x,y)))\end{split}\end{equation}$$

The outcome of above corresponds to the result you listed. Hope this can help you a bit.

Best Answer

Related Solutions

Solved – Expected prediction error – derivation

Regression – How to Derive a Regression Function: Navigating Common Confusions

Related Question