Machine Learning – How Many Normal Distributions for Posterior in Variational Autoencoders?

autoencodersmachine learningneural networksvariational-bayes

I am currently reading about variational autoencoders.

Some of the papers I've read are:

The papers differ in one fundamental issue, Doersch only has one layer which produces the standard deviation and mean of a normal distribution, which is located in the encoder, whereas the other have two such layers, one in exactly the same position in the encoder as Doersch and the other one in the last layer, before the reconstructed value.

I have seen the variant with only one such layer in some machine learning books as well.

So which one of the two methods is used more often? Is one superior to the other? What is the philosophy in using one vs using two such layers? (A good scheme can be found in the last paper on page 3)

As requested I am adding a graphic:
VAE with 2 times sampling

This graphic is from the last paper I linked. It shows the typical latent variable layer, produced by sampling from a normal distribution. The $\mu_z$ and $\sigma_z$ are generated by the neural network. The first paper I mentioned only uses the latent layer in the encoding part, meaning the right part of the graphic, the decoder, does not have a sampling and does not produce $\mu_x$ and $\sigma_x$, but instead just recreates the input just by a normal feedforward network.

So to repeat my question: What are advantages/disadvantages of the the two approaches.

Best Answer

A VAE defines a distribution: $P(x) = \int P(x|z)P(z) dz$. $P(x|z)$ is often a normal distribution $\mathcal{N}(\mu, \sigma^2)$ whose mean and/or variance are parameterized by a neural network (called the "decoder").

In some cases, we assume a fixed variance $\sigma^2$, but in other cases we might try to model the variance as well. Of course we always have to output $\mu$.

I want to point out that even in the cases where you don't have two layers, you are still outputting $\mu$. Simply training with MSE loss is equivalent to maximizing log likelihood under a gaussian with mean $\mu$ (and $\sigma^2$ is implicitly some fixed value which depends on the scaling of the MSE loss term vs the KL loss term).

What are advantages/disadvantages of the the two approaches.

The advantages of modeling $\sigma^2$ is a slightly more expressive model, and the ability to say that the model is more uncertain about some dimensions than other dimensions.

The disadvantages is that it's slightly more work to code, and in most cases people only use the mean of the distribution, so there's no reason to bother. You also have to worry about degenerate optimum -- the model could overfit to a small dataset by perfectly memorizing $\mu$, and then by setting $\sigma^2 \rightarrow 0$, leading to negative infinity loss.