# A Question About (6.15) in Deep Learning (Goodfellow et al.)

expected valueinner-productsnormed-spacesoptimizationprobability distributions

On page 175 of [1], the authors wrote:

Our first result derived using calculus of variations is that solving the optimization problem
$$f^* = \arg\min_f \mathbb{E}_{\mathbf{x},\mathbf{y} \sim p_{\rm data}} \| \mathbf{y} – f(\mathbf{x}) \|^2 \tag{6.14}$$
yields
$$f^*(\mathbf{x}) = \mathbb{E}_{\mathbf{y} \sim p_{\rm data}(\mathbf{y} \mid \mathbf{x})} [ \mathbf{y} ], \tag{6.15}$$
so long as this function lies within the class we optimize over.

One of my students asked me a question: "How to derive (6.15) from (6.14)?" The following is my trial:

From Section 3.8 in [1], we know that the expectation, or expected value, of some function $$f(x)$$ with respect to a probability distribution $$P(\mathbf{x})$$ is the average, or mean value, that $$f$$ takes on when $$x$$ is drawn from $$P$$. For discrete variables this can be computed with a summation:
$$\mathbb{E}_{\mathbf{x} \sim P} [ f(x) ] = \sum_x P(x) f(x), \tag{1}$$
while for continuous variable, it is computed with an integral:
$$\mathbb{E}_{\mathbf{x} \sim p} [ f(x) ] = \int p(x) f(x) dx. \tag{2}$$
Therefore,
\begin{align} f^* & = \arg\min_f \int \int p_{\mathbf{x},\mathbf{y}}(x,y) \| \mathbf{y} – f(\mathbf{x}) \|^2 dx dy \notag \\ & = \arg\min_f \int \int p_{\mathbf{y}|\mathbf{x}}(y|x) p_{\mathbf{x}}(x) \| \mathbf{y} – f(\mathbf{x}) \|^2 dx dy, \end{align}
where $$p_{\mathbf{x},\mathbf{y}}(x,y)$$ is the joint probability density function (PDF) of random vectors $$\mathbf{x}$$ and $$\mathbf{y}$$, $$p_{\mathbf{y}|\mathbf{x}}(y|x)$$ is the conditional PDF of random vector $$\mathbf{y}$$ given $$\mathbf{x} = x$$, and $$p_{\mathbf{x}}(x)$$ is the marginal PDF of random vector $$\mathbf{x}$$. Furthermore,
\begin{align} \| \mathbf{y} – f(\mathbf{x}) \|^2 & = \langle \mathbf{y} – f(\mathbf{x}), \mathbf{y} – f(\mathbf{x}) \rangle \notag \\ & = \| \mathbf{y} \|^2 – \langle \mathbf{y}, f(\mathbf{x}) \rangle – \langle f(\mathbf{x}), \mathbf{y} \rangle + \| f(\mathbf{x}) \|^2, \tag{3} \end{align}
where $$\langle \mathbf{a}, \mathbf{b} \rangle$$ denotes the inner product of vectors $$\mathbf{a}$$ and $$\mathbf{b}$$. Then I do not know how to continue. Any comments and answers are welcome. Thank you in advance.

Reference

[1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, Cambridge: The MIT Press, 2016.

Classical proof add and substract conditional expectation $$\mathbb{E}_{y\sim p(y|x)}[y]$$: $$\|y-f(x)\|^2=\|(y-\mathbb{E}_{y\sim p(y|x)}[y])+(\mathbb{E}_{y\sim p(y|x)}[y]-f(x))\|^2$$ $$=\|y-\mathbb{E}_{y\sim p(y|x)}[y]\|^2+\|\mathbb{E}_{y\sim p(y|x)}[y]-f(x)\|^2+2\langle y-\mathbb{E}_{y\sim p(y|x)}[y];\mathbb{E}_{y\sim p(y|x)}[y]-f(x)\rangle$$ but when you take expectation term by term, the last addend desappears. $$\mathbb{E}_{(x,y^\prime)\sim p}[\langle y^\prime-\mathbb{E}_{y\sim p(y|x)}[y];\mathbb{E}_{y\sim p(y|x)}[y]-f(x))\rangle]=\mathbb{E}_{x\sim p}[\mathbb{E}_{y^\prime\sim p(y^\prime|x)}[\langle y^\prime-\mathbb{E}_{y\sim p(y|x)}[y];\mathbb{E}_{y\sim p(y|x)}[y]-f(x))\rangle]]$$ $$=\mathbb{E}_{x\sim p}[\langle\mathbb{E}_{y^\prime\sim p(y^\prime|x)}[y^\prime-\mathbb{E}_{y\sim p(y|x)}[y]];\mathbb{E}_{y\sim p(y|x)}[y]-f(x))\rangle]=0$$ Finally $$\mathbb{E}_{(x,y)\sim p}[\|y-f(x)\|^2]=\mathbb{E}_{(x,y^\prime)\sim p}[\|y^\prime-\mathbb{E}_{y\sim p(y|x)}[y]\|^2]+\mathbb{E}_{x\sim p}[\|\mathbb{E}_{y\sim p(y|x)}[y]-f(x)\|^2]$$ $$\geq\mathbb{E}_{(x,y^\prime)\sim p}[\|y^\prime-\mathbb{E}_{y\sim p(y|x)}[y]\|^2]=\mathbb{E}_{x\sim p}[\text{var}_{y\sim p(y|x)}(y)]$$ Finally It is clear that the function $$f(x)$$ that minimizes MSE is the conditional expectation $$\mathbb{E}_{y\sim p(y|x)}[y]$$ and in this case the MSE is $$\mathbb{E}_{x\sim p}[\text{var}_{y\sim p(y|x)}(y)]$$.