Machine Learning – Is the Output of a Variational Autoencoder a Distribution or a Sample?

autoencodersmachine learningvariational-bayes

It is difficult to ask this question succinctly in the title, so let me explain.

From all the examples of VAEs I have seen, there seem to be 2 approaches used to implement them.

  1. In these, approaches, the output of the decoder is a
    distribution. Specifically, a Bernoulli distribution. The loss is
    related to the likelihood of the input data with respect to this
    distribution. When generating new variables, not only are samples
    drawn from the latent space distributions, but also, samples are
    drawn from the output distributions. I have a question related to
    the parameterization of the output distributions using this
    approach, but I will save that for the end.

  2. In these, approaches, the output of the decoder is the
    actual predicted data points, which are then compared to the input
    to calculate the loss, as mean squared error. The only thing that
    ever gets sampled is the latent space.

These 2 approaches are fundamentally different. I notice that in the first approach, the authors relate the idea to a statistical/probabilistic motivation, and it seems to have a solid theoretical foundation. For this reason I am inclined to adopt the 1st approach.

So my questions are:

  1. Are either of these approaches considered to be superior?
  2. The second question is more of a sense check: Bernoulli distribution
    makes sense for grey scale pixels. But for something like stock
    returns, we would want some other distribution, such as a
    t-distribution. In this case, the output of the encoder would be 3
    vectors, 1 for location, 1 for scale, and 1 for degrees of freedom,
    right?

Best Answer

Question 1:

The output of the decoder aims to model the distribution $p(x|t)$, i.e. the distribution of data $x$ given the latent variable $t$. Therefore, in principle, it should always be probabilistic.

However, in some cases, people simply use the mean squared error as the loss and, as you said, the output of the decoder is the actual predicted data points. Note that this approach can also be viewed as probabilistic, in the sense that it is is equivalent to modeling $p(x|t)$ as Gaussian with identity covariance, $p(x|t) = \mathcal{N}(x|\mu(t), I)$. In this case, the output of the decoder is the mean $\mu(t)$ and, therefore, for an example $x_i$, you get the following reconstruction loss:

\begin{align} -\log(p(x_i | t_i)) &= -\log \left(\frac{1}{\sqrt{(2\pi)^k |I|}} \exp \left(-\frac{1}{2}(x_i-\mu(t_i))^\intercal I (x_i-\mu(t_i))\right)\right) \\ &= \frac{1}{2}||x_i - \mu(t_i)||^2 + \text{const.} \end{align}

which, as you can see, is proportional to the mean squared error (plus some constant).

Question 2:

Bernoulli distribution makes sense for grey scale pixels.

This is not quite true. The correct statement would be that Bernoulli distribution makes sense for black and white (i.e. binary) images. The Bernoulli distribution is binary, so it assumes that observations may only have two possible outcomes. It is true that people sometimes use it for grayscale images, but this is an abusive interpretation of the VAE. It may work pretty well for datasets that are almost black and white, like MNIST. However, a binarized version of the MNIST dataset exists and, in rigor, this is the version that should be used together with a Bernoulli VAE.

But for something like stock returns, we would want some other distribution, such as a t-distribution. In this case, the output of the encoder would be 3 vectors, 1 for location, 1 for scale, and 1 for degrees of freedom, right?

I would try a Gaussian first, $p(x|t) = \mathcal{N}(x|\mu(t), \sigma^2(t))$, so the decoder would output two values, $\mu(t)$ and $\sigma^2(t)$. But yeah, if you really want a t-distribution, then that is the way to go.

Related Question