Variational Autoencoders – Assumptions and Considerations in Variational Inference

autoencodersvariational inferencevariational-bayes

I am currently reading the paper "Importance Weighted Autoencoders" and am having a hard time understanding something regarding the original Variational Autoencoder (VAE) as described here

In the first paragraph of the third subsection the author wrote this:

The VAE objective of Eqn. 3 heavily penalizes approximate posterior samples which fail to explain the observations. This places a strong constraint on the model, since the variational assumptions must be approximately satisfied in order to achieve a good lower bound. In particular, the posterior distribution must be approximately factorial and predictable with a feed-forward neural network. This VAE criterion may be too strict; a recognition network which places only a small fraction (e.g. 20%) of its samples in the region of high posterior probability region may still be sufficient for performing accurate inference.

The things in bold are the things that are not clear to me, where in the algorithm of the VAE do we constraint the posterior distribution to be approximatly factorial? and why is that constraint bad?
Additionally, how can does the new IWAE objective function relate to the last sentence of the paragraph? (the one with the 20%)

Thanks in advance!

Best Answer

The VAE objective of Eqn. 3 heavily penalizes approximate posterior samples which fail to explain the observations.

Here's one possible interpretation. The variational lower bound on the log marginal likelihood (equation 3) can be rewritten as:

$$\log p(x) \ge E_{q(h \mid x)} \Big[ \log p(x \mid h) \Big] - D_{KL} \Big( q(h \mid x) \ \big\| \ p(h) \Big)$$

where $x$ contains observed variables, $h$ contains latent variables, $p(x \mid h)$ is the likelihood, $p(h)$ is the prior, and $q(h \mid x)$ is the variational posterior. The first term on the RHS is the expected log likelihood of the data, given latent variables drawn from the variational posterior. This term will be large if the variational posterior assigns high density to latent variable values that are likely to give rise to the data. That is, if the variational posterior and likelihood together "explain" the data well. Otherwise, this term—and therefore the variational lower bound—will be small.

In particular, the posterior distribution must be approximately factorial and predictable with a feed-forward neural network.

This could be rephrased as: A VAE seeks to approximate the true posterior with a factorized Gaussian distribution whose parameters are given by a neural net.

Maximizing the variational lower bound is also equivalent to minimizing $D_{KL} \Big( q(h \mid x) \ \big\| \ p(h \mid x) \Big)$, the KL divergence between the approximate and true posteriors. That is, we're searching for an approximate posterior that matches the true one as well as possible (as measured by KL divergence).

The original VAE paper (Kingma and Welling 2013) defines the variational posterior as a Gaussian distribution with isotropic covariance matrix (see equation 12 in appendix C.2):

$$q(h \mid x) = \mathcal{N} \Big( h \ \big| \ \mu(x), \sigma(x)^2 I \Big)$$

The conditional mean $\mu(x)$ and variance $\sigma(x)^2$ are outputs of a neural net that takes the observed variable $x$ as input. Since the covariance matrix $\sigma(x)^2 I$ is diagonal, the multivariate Gaussian distribution above factorizes as a product of univariate Gaussian distributions (one for each latent variable). In fact, it's even more restrictive than that, because the covariance matrix is isotropic, not just diagonal. So, all latent variables share the same conditional variance.

why is that constraint bad?

It's not necessarily bad. If a factorized Gaussian is a satisfactory approximation of the true posterior then there's no problem at all. But, the approximation will be poor if the true posterior has a more complex shape. Also, recall that we're jointly searching for a good likelihood and variational posterior. If the variational posterior is insufficient, it's possible the likelihood could be dragged into a bad regime in order to let the variational posterior to fit slightly better.

That said, IWAEs use the exact same form for the approximate posterior. The authors claim in various places that IWAEs give "increased flexibility" over VAEs but I don't see any detailed explanations of how/why this should be the case.

This VAE criterion may be too strict; a recognition network which places only a small fraction (e.g. 20%) of its samples in the region of high posterior probability region may still be sufficient for performing accurate inference.

I find this statement puzzling. If an approximate posterior places only a small fraction of samples in a region of high posterior density (according to the true posterior), it sounds like a poor approximation of the true posterior. How would that be considered "accurate inference"? Maybe they're referring to learning a better likelihood, despite a poor match between the approximate and true posteriors. But, as above, they don't really seem to elaborate.