Solved – Loss functions for regression proof

conditional-expectationleast squaresloss-functionsregression

I'm using Bishop's Pattern Recognition and Machine Learning. In section 1.5.5, loss functions for regression, namely the squared loss, is discussed.

$\mathbb{E}[L] = \displaystyle\int\int \{ y(x)-t\}^{2}p(x,t) dx dt $

The book makes the following remark:

$\{ y(x)-t\}^{2} = \{y(x) – \mathbb{E}[t|x] + \mathbb{E}[t|x] – t \}^{2}
\\ = \{y(x) – \mathbb{E}[t|x] \}^{2} + 2\{y(x) – \mathbb{E}[t|x]\}\{\mathbb{E}[t|x]-t\} + \{\mathbb{E}[t|x]-t\}^{2}$

The resulting expression shown above is substituted into the loss function, integrated over $t$, and then it is seen that the cross-term (the second term) vanishes. The result obtained is:

$\mathbb{E}[L] = \displaystyle\int\int \{ y(x)-t\}^{2}p(x,t) dx dt \\ = \displaystyle\int \{y(x) – \mathbb{E}[t|x] \}^{2} p(x) dx + \displaystyle\int \{\mathbb{E}[t|x]-t\}^{2} p(x) dx$

What I don't understand is the algebra involved to get the final result. Why does the cross-term vanish? For the last term, how are you bringing $t$ outside the integral over $t$? Perhaps I am missing something here, could someone care to explain?

Best Answer

I would like to explain the way I understood it, explaining each and every step on the way.

Assumptions:

  1. $g(x,t)$ is a function of x and t.
  2. $p(x,t)$ is a joint distribution over $x$ and $t$.

Basic formulas:

$$\mathbb{E}_t[g|x] = \int_t{g(x,t)p(t|x)\mathop{dt}} \ (\mathbb{E}_t[g|x] \text{ is a function of $x$ and constant w.r.t. } t) \tag{1}\label{1} $$ $$\mathbb{E}_t[t|x] = \int_t{t.p(t|x)\mathop{dt}} \tag{2}\label{2}$$ $\operatorname{var}_t[t|x] = \int_t{(t - \mathbb{E}_t[t|x])^2p(t|x)\mathop{dt}} = \mathbb{E}_t[(t - \mathbb{E}_t[t|x])^2 | x] \tag{3}\label{3}$ $$ \eqalign{\mathbb{E}_t[f(x)g(x,t)|x] &= \int_t{f(x)g(x,t)p(t|x)\mathop{dt}} \\ &= f(x)\int_t{g(x,t)p(t|x)\mathop{dt}} \\ &= f(x) \ \mathbb{E}_t[g|x] } \tag{4}\label{4}$$ $$\mathbb{E}_t[f(x)|x] = f(x) \tag{4a}\label{4a}$$ $$\mathbb{E}_{x,t}[g] = \mathbb{E}_x[\mathbb{E}_t[g|x]] \tag{5}\label{5}$$

We derive the last formula above. $$ \eqalign{ \mathbb{E}_{x,t}[g] &= \int_x\int_tg(x,t)p(x,t)\mathop{dx}\mathop{dt}\\ &= \int_x\int_tg(x,t)p(x)p(t|x)\mathop{dx}\mathop{dt}\\ &= \int_x p(x)\int_t g(x,t)p(t|x)\mathop{dt}\mathop{dx}\\ &= \int_x \mathbb{E}_t[g|x]p(x)\mathop{dx} \text{ (using \ref{1}) } \\ &= \mathbb{E}_x [\mathbb{E}_t[g|x]] \\ } $$


Derivation of the expected loss:

Represent the Loss function in the form as below. Please notice the subscript $t$ in the $\mathbb{E}_t$ notations. This was omitted in the book, but I added it here for clarity. $$ \eqalign{ L(x,t) &= (y(x)-t)^{2} \\ &= (y(x) - \mathbb{E}_t[t|x])^{2} + 2(y(x) - \mathbb{E}_t[t|x])(\mathbb{E}_t[t|x]-t) + (\mathbb{E}_t[t|x]-t)^{2} \\ &= L_1 + 2L_2 + L_3 } $$

Hence the joint expectation can be represented as:
$$ \eqalign{ \mathbb{E}_{x,t}[L] &= \mathbb{E}_{x,t}[L_1] + 2\mathbb{E}_{x,t}[L_2] + \mathbb{E}_{x,t}[L_3] } $$

We derive the 3 expectations: $$ \eqalign{ \mathbb{E}_{x,t}[L_1] &= \mathbb{E}_{x,t}[(y(x) - \mathbb{E}_t[t|x])^{2}] \\ &= \mathbb{E}_x[ \ \mathbb{E}_t[(y(x) - \mathbb{E}_t[t|x])^{2} | x] \ ] \ \ \text{ (using \ref{5}) } \\ &= \mathbb{E}_x[(y(x) - \mathbb{E}_t[t|x] )^{2}] \text{ (using \ref{4a}, as the operand is a function of $x$ only)} \\ } $$


$$ \eqalign{ \mathbb{E}_{x,t}[L_2] &= \mathbb{E}_{x,t}[(y(x) - \mathbb{E}_t[t|x])(\mathbb{E}_t[t|x]-t)] \\ &= \mathbb{E}_x[ \ \mathbb{E}_t[\{(y(x) - \mathbb{E}_t[t|x])(\mathbb{E}_t[t|x]-t)\} \ | \ x] \ ] \text{ (using \ref{5}) } \\ &= \mathbb{E}_x[ \ (y(x) - \mathbb{E}_t[t|x]) \ \mathbb{E}_t[(\mathbb{E}_t[t|x]-t) | x] \ \ ] \ \text{ (using \ref{4} on the inner expectation)} } $$

Considering only the inner expectation: $$ \eqalign{ \mathbb{E}_t[(\mathbb{E}_t[t|x]-t) | x] &= \mathbb{E}_t[\mathbb{E}_t[t|x] | x] - \mathbb{E}_t[t|x] \text{ (using inearity of $\mathbb{E}$)} \\ &= \mathbb{E}_t[t|x] - \mathbb{E}_t[t|x] \text{ (using \ref{4a} as $\mathbb{E}_t[t|x]$ is a function of $x$)}\\ &= 0 } $$

Therefore, $$ \eqalign{ \mathbb{E}_{x,t}[L_2] &= \mathbb{E}_x[ \ (y(x) - \mathbb{E}_t[t|x]) \ \cdot \ 0 \ ] \\ &= 0 } $$


$$ \eqalign{ \mathbb{E}_{x,t}[L_3] &= \mathbb{E}_{x,t}[(\mathbb{E}_t[t|x]-t)^{2}] \\ &= \mathbb{E}_x[ \ \mathbb{E}_t[(\mathbb{E}_t[t|x]-t)^{2} | x] \ ] \text{ (using \ref{5}) } \\ &= \mathbb{E}_x[\operatorname{var}_t[t|x]] \text{ (using \ref{3}) } } $$

Putting them all together and expressing the $\mathbb{E}_x$ terms as integrals under $x$, we get the following form:

$$ \mathbb{E}_{x,t}[L] = \int_x (y(x) - \mathbb{E}_t[t|x])^2 p(x)\mathop{dx} + \int_x \operatorname{var}_t[t|x] p(x) \mathop{dx} $$

Note: As mentioned by @Juho Kokkalla, the erroneous last term in the book is corrected in the errata.