A Question About (6.15) in Deep Learning (Goodfellow et al.)

On page 175 of [1], the authors wrote:

Our first result derived using calculus of variations is that solving the optimization problem
$$ f^* = \arg\min_f \mathbb{E}_{\mathbf{x},\mathbf{y} \sim p_{\rm data}} \| \mathbf{y} – f(\mathbf{x}) \|^2 \tag{6.14}$$
yields
$$ f^*(\mathbf{x}) = \mathbb{E}_{\mathbf{y} \sim p_{\rm data}(\mathbf{y} \mid \mathbf{x})} [ \mathbf{y} ], \tag{6.15} $$
so long as this function lies within the class we optimize over.

One of my students asked me a question: "How to derive (6.15) from (6.14)?" The following is my trial:

From Section 3.8 in [1], we know that the expectation, or expected value, of some function $f(x)$ with respect to a probability distribution $P(\mathbf{x})$ is the average, or mean value, that $f$ takes on when $x$ is drawn from $P$. For discrete variables this can be computed with a summation:
$$ \mathbb{E}_{\mathbf{x} \sim P} [ f(x) ] = \sum_x P(x) f(x), \tag{1} $$
while for continuous variable, it is computed with an integral:
$$ \mathbb{E}_{\mathbf{x} \sim p} [ f(x) ] = \int p(x) f(x) dx. \tag{2} $$
Therefore,
\begin{align}
f^*
& = \arg\min_f \int \int p_{\mathbf{x},\mathbf{y}}(x,y) \| \mathbf{y} – f(\mathbf{x}) \|^2 dx dy \notag \\
& = \arg\min_f \int \int p_{\mathbf{y}|\mathbf{x}}(y|x) p_{\mathbf{x}}(x) \| \mathbf{y} – f(\mathbf{x}) \|^2 dx dy,
\end{align}
where $p_{\mathbf{x},\mathbf{y}}(x,y)$ is the joint probability density function (PDF) of random vectors $\mathbf{x}$ and $\mathbf{y}$, $p_{\mathbf{y}|\mathbf{x}}(y|x)$ is the conditional PDF of random vector $\mathbf{y}$ given $\mathbf{x} = x$, and $p_{\mathbf{x}}(x)$ is the marginal PDF of random vector $\mathbf{x}$. Furthermore,
\begin{align}
\| \mathbf{y} – f(\mathbf{x}) \|^2
& = \langle \mathbf{y} – f(\mathbf{x}), \mathbf{y} – f(\mathbf{x}) \rangle \notag \\
& = \| \mathbf{y} \|^2 – \langle \mathbf{y}, f(\mathbf{x}) \rangle – \langle f(\mathbf{x}), \mathbf{y} \rangle + \| f(\mathbf{x}) \|^2, \tag{3}
\end{align}
where $\langle \mathbf{a}, \mathbf{b} \rangle$ denotes the inner product of vectors $\mathbf{a}$ and $\mathbf{b}$. Then I do not know how to continue. Any comments and answers are welcome. Thank you in advance.

Reference

[1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, Cambridge: The MIT Press, 2016.

Best Answer

Classical proof add and substract conditional expectation $\mathbb{E}_{y\sim p(y|x)}[y]$: $$\|y-f(x)\|^2=\|(y-\mathbb{E}_{y\sim p(y|x)}[y])+(\mathbb{E}_{y\sim p(y|x)}[y]-f(x))\|^2$$ $$=\|y-\mathbb{E}_{y\sim p(y|x)}[y]\|^2+\|\mathbb{E}_{y\sim p(y|x)}[y]-f(x)\|^2+2\langle y-\mathbb{E}_{y\sim p(y|x)}[y];\mathbb{E}_{y\sim p(y|x)}[y]-f(x)\rangle$$ but when you take expectation term by term, the last addend desappears. $$\mathbb{E}_{(x,y^\prime)\sim p}[\langle y^\prime-\mathbb{E}_{y\sim p(y|x)}[y];\mathbb{E}_{y\sim p(y|x)}[y]-f(x))\rangle]=\mathbb{E}_{x\sim p}[\mathbb{E}_{y^\prime\sim p(y^\prime|x)}[\langle y^\prime-\mathbb{E}_{y\sim p(y|x)}[y];\mathbb{E}_{y\sim p(y|x)}[y]-f(x))\rangle]]$$ $$=\mathbb{E}_{x\sim p}[\langle\mathbb{E}_{y^\prime\sim p(y^\prime|x)}[y^\prime-\mathbb{E}_{y\sim p(y|x)}[y]];\mathbb{E}_{y\sim p(y|x)}[y]-f(x))\rangle]=0$$ Finally $$\mathbb{E}_{(x,y)\sim p}[\|y-f(x)\|^2]=\mathbb{E}_{(x,y^\prime)\sim p}[\|y^\prime-\mathbb{E}_{y\sim p(y|x)}[y]\|^2]+\mathbb{E}_{x\sim p}[\|\mathbb{E}_{y\sim p(y|x)}[y]-f(x)\|^2]$$ $$\geq\mathbb{E}_{(x,y^\prime)\sim p}[\|y^\prime-\mathbb{E}_{y\sim p(y|x)}[y]\|^2]=\mathbb{E}_{x\sim p}[\text{var}_{y\sim p(y|x)}(y)]$$ Finally It is clear that the function $f(x)$ that minimizes MSE is the conditional expectation $\mathbb{E}_{y\sim p(y|x)}[y]$ and in this case the MSE is $\mathbb{E}_{x\sim p}[\text{var}_{y\sim p(y|x)}(y)]$.

Best Answer

Related Solutions

[Math] Distribution of Dot-Product of Two Independent Multivariate Gaussian Vectors

Related Question