How does the reparameterization trick for variational autoencoders (VAE) work? Is there an intuitive and easy explanation without simplifying the underlying math? And why do we need the 'trick'?
Solved – How does the reparameterization trick for VAEs work and why is it important
autoencodersgenerative-modelsmathematical-statisticsvariational-bayes
Related Solutions
The answer is "yes" in one sense and "no" in another sense.
Suppose $X \sim \operatorname{Gamma}(\alpha,\beta)$. Let $F_{\alpha,\beta}$ denote the cdf of a Gamma distribution. Then define $\epsilon = \Phi^{-1}[F_{\alpha,\beta}(X)]$. Then if you simulate $\epsilon \sim N(0,1)$ you can get the relevant gamma distribution by setting $X = F^{-1}_{\alpha,\beta}[\Phi(\epsilon)]$. This is a consequence of the probability integral transform. Additionally, the transform $T(\epsilon ; \alpha, \beta) = F^{-1}_{\alpha,\beta}[\Phi(\epsilon)]$ is differentiable. So you could use this idea with the reparametrization trick, at least in principle, to improve your stochastic variational inference. This implies that, in a liberal sense, the answer is "yes, there is a reparameterization trick", and in fact there is one for essentially any family of continuous distributions. If this seems sort of ad-hoc, notice that if you apply this trick with the Gaussian family in place of the gamma, you get back exactly the usual reparameterization trick.
In a more restrictive sense, I would say the answer is "no". The function $F^{-1}$ above is not available in closed form, so things are not so convenient to the point where we might disqualify this approach. Alternatively, there is no reason to restrict ourselves to $\epsilon \sim N(0,1)$, and we might just ask for $\epsilon \sim Q$ for some standard distribution $Q$ that is easy to sample from, such that $T(\epsilon; \alpha, \beta) \sim \text{Gamma}(\alpha,\beta)$ where $T$ is also easy to compute.
If you find such a transformation $T$ and standard distribution $Q$, let me know because I would be interested in it. The main problem is the shape parameter $\alpha$. If I know $\alpha$, then I can take $T(\epsilon; \alpha, \beta) = \epsilon \beta$ and set $\epsilon \sim \mbox{Gamma}(\alpha,1)$, because the gamma family with know $\alpha$ is a scale family. The shape parameter does not have any nice algebraic properties as far as I know, aside from the fact that $X_1 + X_2 \sim \mbox{Gamma}(\alpha_1 + \alpha_2, 1)$ provided that $X_i \sim \mbox{Gamma}(\alpha_i, 1)$ and they are independent. It's not clear how to take advantage of this. A negative result for us is that, if such a convenient $T$ existed, then R would probably use that to sample from the a generic gamma distribution, but instead it uses rejection sampling.
After $\epsilon$ is sampled, it is completely known; we can treat it the same way as any other data (image, text, feature vector) that's input to a neural network. Just like your input data, $\epsilon$ is known and won't change after you sample it.
This means that the expression $ z = \mu + \sigma \odot \epsilon $ has no random components after sampling: you know $\mu,\sigma$ because you obtained them from the encoder, and you know $\epsilon$ because you've sampled it. As a result of sampling, $\epsilon$ is known and fixed at a particular value. This means that you can backprop $\mu + \sigma \odot \epsilon$ with respect to $\mu, \sigma$ because all of its elements are known and fixed.
By contrast, the expression $z \sim \mathcal{N}(\mu,\sigma^2)$ is not deterministic in $\mu, \sigma$, so you can't write a backprop expression with respect to $\mu, \sigma$ for it. Even though $\mu, \sigma$ are fixed, you can obtain any real number as an output.
Best Answer
After reading through Kingma's NIPS 2015 workshop slides, I realized that we need the reparameterization trick in order to backpropagate through a random node.
Intuitively, in its original form, VAEs sample from a random node $z$ which is approximated by the parametric model $q(z \mid \phi, x)$ of the true posterior. Backprop cannot flow through a random node.
Introducing a new parameter $\epsilon$ allows us to reparameterize $z$ in a way that allows backprop to flow through the deterministic nodes.