Gaussian VAE – Is the Optimization of the Gaussian VAE Well-Posed?

deep learninggenerative-modelsmachine learningvariational-bayes

In a Variational Autoencoder (VAE), given some data $x$ and latent variables $t$ with prior distribution $p(t) = \mathcal{N}(t \mid 0, I)$, the encoder aims to learn a distribution $q_{\phi}(t)$ that approximates the true posterior $p(t \mid x)$ and the decoder aims to learn a distribution $p_\theta(x\mid t)$ that approximates the true underlying distribution $p^*(x\mid t)$.

These models are then trained jointly to maximize an objective $L(\phi, \theta)$, which is a lower bound for the log-likelihood of the training set:

$$L(\varphi, \theta) = \sum_i \mathbb{E}_{q_\varphi} \log \frac{p_\theta(x_i\mid t) p(t)}{q_\varphi(t)} \leq \sum_i \log \int p_\theta(x_i\mid t)p(t) \, dt$$

According to section C.2 in the original paper from Kingma and Welling (https://arxiv.org/pdf/1312.6114.pdf), when we model $p_{\theta}(x|t)$ as a family of gaussians, the decoder should output both the mean $\mu(t)$ and the (diagonal) covariance $\sigma^2(t) I$ for the gaussian distribution.

My question is: isn't this optimization problem ill-posed (just like maximum likelihood training in GMMs)? Having an output for the variance (or log-variance, as is most common), if the decoder can produce a perfect reconstruction for a single image in the training set (i.e. $\mu(t_i)=x_i$) then it can set the corresponding variance $\sigma^2(t_i)$ to something arbitrarily close to zero and therefore the likelihood goes to infinity regardless of what happens with the remaining training examples.

I know that most gaussian VAE implementations have a simplified decoder that outputs the mean only, replacing the term $\mathbb{E}_{q_\varphi} \log p_\theta (x_i\mid t)$
by the squared error between the original image and the reconstruction (which is equivalent to setting the covariance to be always the identity matrix). Is this because of the ill-posedness of the original formulation?

Best Answer

I co-wrote a paper on this exact problem:

https://papers.nips.cc/paper/7642-leveraging-the-exact-likelihood-of-deep-latent-variable-models

We show that, as you thought, maximum-likelihood is ill-posed for Gaussian output VAEs. Things go pretty much like for GMMs. A solution is to constrain the eigenvalues of the covariance network to be bigger than some threshold.

An interesting remark is that, for discrete data, the problem is well-posed. This possibly explains why VAEs are usually benchmarked on discrete data sets (like binary MNIST).

We show all these results in Section 2.1 of our paper.

Similar investigations were also conducted in this paper:

http://www.jmlr.org/papers/volume19/17-704/17-704.pdf

they show (Theorem 5) that the VAE objective is unbounded. This means that, in general, even having the KL term does not make the objective well-posed.

Related Question