Solved – Why is random sampling a non-differentiable operation

autoencodersbackpropagationgradient descentmachine learningvariational-bayes

This answer states that we cannot back-propagate through a random node. So, in the case of VAEs, you have the reparametrisation trick, which shifts the source of randomness to another variable different than $z$ (the latent vector), so that you can now differentiate with respect to $z$. Similarly, this question states that we cannot differentiate a random sampling operation.

Why exactly is this the case? Why is randomness a problem when differentiating and back-propagating? I think this should be made explicit and clear.

Best Answer

Gregory Gundersen wrote a blog post about this in 2018. He explictly answers the questions:

What does a “random node” mean and what does it mean for backprop to “flow” or not flow through such a node?

The following excerpt should answer your questions:

Undifferentiable expectations

Let’s say we want to take the gradient w.r.t. $\theta$ of the following expectation, $$\mathbb{E}_{p(z)}[f_{\theta}(z)]$$ where $p$ is a density. Provided we can differentiate $f_{\theta}(x)$, we can easily compute the gradient:

$$ \begin{align} \nabla_{\theta} \mathbb{E}_{p(z)}[f_{\theta}(z)] &= \nabla_{\theta} \Big[ \int_{z} p(z) f_{\theta}(z) dz \Big] \\ &= \int_{z} p(z) \Big[\nabla_{\theta} f_{\theta}(z) \Big] dz \\ &= \mathbb{E}_{p(z)} \Big[\nabla_{\theta} f_{\theta}(z) \Big] \end{align} $$

In words, the gradient of the expectation is equal to the expectation of the gradient. But what happens if our density $p$ is also parameterized by $\theta$?

$$ \begin{align} \nabla_{\theta} \mathbb{E}_{p_{\theta}(z)}[f_{\theta}(z)] &= \nabla_{\theta} \Big[ \int_{z} p_{\theta}(z) f_{\theta}(z) dz \Big] \\ &= \int_{z} \nabla_{\theta} \Big[ p_{\theta}(z) f_{\theta}(z) \Big] dz \\ &= \int_{z} f_{\theta}(z) \nabla_{\theta} p_{\theta}(z)dz + \int_{z} p_{\theta}(z) \nabla_{\theta} f_{\theta}(z)dz \\ &= \underbrace{\int_{z} f_{\theta}(z) \nabla_{\theta} p_{\theta}(z)}_{\text{What about this?}}dz + \mathbb{E}_{p_{\theta}(z)} \Big[f_{\theta}(z)\Big] \end{align}$$

The first term of the last equation is not guaranteed to be an expectation. Monte Carlo methods require that we can sample from $p_{\theta}(z)$, but not that we can take its gradient. This is not a problem if we have an analytic solution to $\nabla_{\theta}p_{\theta}(z)$, but this is not true in general. 1