Variational Autoencoders – What Are Variational Autoencoders and Their Applications in Learning Tasks

autoencodersbayesiandeep learningmachine learningvariational-bayes

As per this and this answer, autoencoders seem to be a technique that uses neural networks for dimension reduction. I would like to additionally know what is a variational autoencoder (its main differences/benefits over a "traditional" autoencoders) and also what are the main learning tasks these algorithms are used for.

Best Answer

Even though variational autoencoders (VAEs) are easy to implement and train, explaining them is not simple at all, because they blend concepts from Deep Learning and Variational Bayes, and the Deep Learning and Probabilistic Modeling communities use different terms for the same concepts. Thus when explaining VAEs you risk either concentrating on the statistical model part, leaving the reader without a clue about how to actually implement it, or vice versa to concentrate on the network architecture and loss function, in which the Kullback-Leibler term seems to be pulled out of thin air. I'll try to strike a middle ground here, starting from the model but giving enough details to actually implement it in practice, or understand someone's else implementation.

VAEs are generative models

Unlike classical (sparse, denoising, etc.) autoencoders, VAEs are generative models, like GANs. With generative model I mean a model which learns the probability distribution $p(\mathbf{x})$ over the input space $\mathcal{x}$. This means that after we have trained such a model, we can then sample from (our approximation of) $p(\mathbf{x})$. If our training set is made of handwritten digits (MNIST), then after training the generative model is able to create images which look like handwritten digits, even though they're not "copies" of the images in the training set.

Learning the distribution of the images in the training set implies that images which look like handwritten digits should have an high probability of being generated, while images which look like the Jolly Roger or random noise should have a low probability. In other words, it means learning about the dependencies among pixels: if our image is a $28\times 28=784$ pixels grayscale image from MNIST, the model should learn that if a pixel is very bright, then there's a significant probability that some neighboring pixels are bright too, that if we have a long, slanted line of bright pixels we may have another smaller, horizontal line of pixels above this one (a 7), etc.

VAEs are latent variable models

The VAE is a latent variables model: this means that $\mathbf{x}$, the random vector of the 784 pixel intensities (the observed variables), is modeled as a (possibly very complicated) function of a random vector $\mathbf{z}\in\mathcal{Z}$ of lower dimensionality, whose components are unobserved (latent) variables. When does such a model make sense? For example, in the MNIST case we think that the handwritten digits belong to a manifold of dimension much smaller than the dimension of $\mathcal{x}$, because the vast majority of random arrangements of 784 pixel intensities, don't look at all like handwritten digit. Intuitively we would expect the dimension to be at least 10 (the number of digits), but it's most likely larger because each digit can be written in different ways. Some differences are unimportant for the quality of the final image (for example, global rotations and translations), but others are important. So in this case the latent model makes sense. More on this later. Note that, amazingly, even if our intuition tells us that the dimension should about 10, we can definitely use just 2 latent variables to encode the MNIST dataset with a VAE (though results won't be pretty). The reason is that even a single real variable can encode infinitely many classes, because it can assume all possible integer values and more. Of course, if the classes have significant overlap among them (such as 9 and 8 or 7 and I in MNIST), even the most complicated function of just two latent variables will do a poor job of generating clearly discernible samples for each class. More on this later.

VAEs assume a multivariate parametric distribution $q(\mathbf{z}\vert\mathbf{x},\boldsymbol{\lambda})$ (where $\boldsymbol{\lambda}$ are the parameters of $q$), and they learn the parameters of the multivariate distribution. The use of a parametric pdf for $\mathbf{z}$, which prevents the number of parameters of a VAE to grow without bounds with the growth of the training set, is called amortization in VAE lingo (yeah, I know...).

The decoder network

enter image description here

We start from the decoder network because the VAE is a generative model, and the only part of the VAE which is actually used to generate new images is the decoder. The encoder network is only used at inference (training) time.

The goal of the decoder network is to generate new random vectors $\mathbf{x}$ belonging to the input space $\mathcal{X}$, i.e., new images, starting from realizations of the latent vector $\mathbf{z}$. This means clearly that it must learn the conditional distribution $p(\mathbf{x}\vert\mathbf{z})$. For VAEs this distribution is often assumed to be a multivariate Gaussian1:

$$p_{\boldsymbol{\phi}}(\mathbf{x}\vert\mathbf{z}) = \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}(\mathbf{z}; \boldsymbol{\phi}), \boldsymbol{\sigma}(\mathbf{z}; \boldsymbol{\phi})^2I) $$

$\boldsymbol{\phi}$ is the vector of weights (and biases) of the encoder network. The vectors $\boldsymbol{\mu}(\mathbf{z};\boldsymbol{\phi})$ and $\boldsymbol{\sigma}(\mathbf{z}; \boldsymbol{\phi})$ are complex, unknown nonlinear functions, modeled by the decoder network: neural networks are powerful nonlinear functions approximators.


As noted by @amoeba in the comments, there is a striking similarity between the decoder and a classic latent variables model: Factor Analysis. In Factor Analysis, you assume the model:

$$ \mathbf{x}\vert\mathbf{z}\sim\mathcal{N}(\mathbf{W}\mathbf{z}+\boldsymbol{\mu}, \boldsymbol{\sigma}^2I),\ \mathbf{z}\sim\mathcal{N}(0,I)$$

Both models (FA & the decoder) assume that the conditional distribution of the observable variables $\mathbf{x}$ on the latent variables $\mathbf{z}$ is Gaussian, and that the $\mathbf{z}$ themselves are standard Gaussians. The difference is that the decoder doesn't assume that the mean of $p(\mathbf{x}|\mathbf{z})$ is linear in $\mathbf{z}$, nor it assumes that the standard deviation is a constant vector. On the contrary, it models them as complex nonlinear functions of the $\mathbf{z}$. In this respect, it can be seen as nonlinear Factor Analysis. See here for an insightful discussion of this connection between FA and VAE. Since FA with an isotropic covariance matrix is just PPCA, this also ties in to the well-known result that a linear autoencoder reduces to PCA.


Let's go back to the decoder: how do we learn $\boldsymbol{\phi}$? Intuitively we want latent variables $\mathbf{z}$ which maximize the likelihood of generating the $\mathbf{x}_i$ in the training set $D_n$. In other words we want to compute the posterior probability distribution of the $\mathbf{z}$, given the data:

$$p(\mathbf{z}\vert\mathbf{x})=\frac{p_{\boldsymbol{\phi}}(\mathbf{x}\vert\mathbf{z})p(\mathbf{z})}{p(\mathbf{x})}$$

We assume a $\mathcal{N}(0,I)$ prior on $\mathbf{z}$, and we're left with the usual issue in Bayesian inference that computing $p(\mathbf{x})$ (the evidence) is hard (a multidimensional integral). What's more, since here $\boldsymbol{\mu}(\mathbf{z};\boldsymbol{\phi})$ is unknown, we can't compute it anyway. Enter Variational Inference, the tool which gives Variational Autoencoders their name.

Variational Inference for the VAE model

Variational Inference is a tool to perform approximate Bayesian Inference for very complex models. It's not an overly complex tool, but my answer is already too long and I won't go into a detailed explanation of VI. You can have a look at this answer and the references therein if you're curious:

https://stats.stackexchange.com/a/270569/58675

It suffices to say that VI looks for an approximation to $p(\mathbf{z}\vert \mathbf{x})$ in a parametric family of distributions $q(\mathbf{z}\vert \mathbf{x},\boldsymbol{\lambda})$, where, as noted above, $\boldsymbol{\lambda}$ are the parameters of the family. We look for the parameters which minimize the Kullback-Leibler divergence between our target distribution $p(\mathbf{z}\vert \mathbf{x})$ and $q(\mathbf{z}\vert \mathbf{x},\boldsymbol{\lambda})$:

$$\min_{\boldsymbol{\lambda}}\mathcal{D}[p(\mathbf{z}\vert \mathbf{x})\vert\vert q(\mathbf{z}\vert \mathbf{x},\boldsymbol{\lambda})]$$

Again, we cannot minimize this directly because the definition of Kullback-Leibler divergence includes the evidence. Introducing the ELBO (Evidence Lower BOund) and after some algebraic manipulations, we finally get at:

$$ELBO(\boldsymbol{\lambda})= E_{q(\boldsymbol{z}\vert \mathbf{x},\boldsymbol{\lambda})}[\log p(\mathbf{x}\vert\boldsymbol{z})]-\mathcal{D}[(q(\boldsymbol{z}\vert \mathbf{x},\boldsymbol{\lambda})\vert\vert p(\boldsymbol{z})]$$

Since the ELBO is a lower bound on evidence (see the above link), maximizing the ELBO is not exactly equivalent to maximizing the likelihood of data given $\boldsymbol{\lambda}$ (after all, VI is a tool for approximate Bayesian inference), but it goes in the right direction.

In order to make inference, we need to specify the parametric family $q(\boldsymbol{z}\vert \mathbf{x},\boldsymbol{\lambda})$. In most VAEs we choose a multivariate, uncorrelated Gaussian distribution

$$q(\mathbf{z}\vert \mathbf{x},\boldsymbol{\lambda}) = \mathcal{N}(\mathbf{z}\vert\boldsymbol{\mu}(\mathbf{x}), \boldsymbol{\sigma}^2(\mathbf{x})I) $$

This is the same choice we made for $p(\mathbf{x}\vert\mathbf{z})$, though we may have chosen a different parametric family. As before, we can estimate these complex nonlinear functions by introducing a neural network model. Since this model accepts input images and returns parameters of the distribution of the latent variables we call it the encoder network. As before, we can estimate these complex nonlinear functions by introducing a neural network model. Since this model accepts input images and returns parameters of the distribution of the latent variables we call it the encoder network.

The encoder network

Also called the inference network, this is only used at training time.

enter image description here

As noted above, the encoder must approximate $\boldsymbol{\mu}(\mathbf{x})$ and $\boldsymbol{\sigma}(\mathbf{x})$, thus if we have, say, 24 latent variables, the output of the encoder is a $d=48$ vector. The encoder has weights (and biases) $\boldsymbol{\theta}$. To learn $\boldsymbol{\theta}$, we can finally write the ELBO in terms of the parameters $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$ of the encoder and decoder network, as well as the training set points:

$$ELBO(\boldsymbol{\theta},\boldsymbol{\phi})= \sum_i E_{q_{\boldsymbol{\theta}}(\boldsymbol{z}\vert \mathbf{x}_i,\boldsymbol{\lambda})}[\log p_{\boldsymbol{\phi}}(\mathbf{x}_i\vert\boldsymbol{z})]-\mathcal{D}[(q_{\boldsymbol{\theta}}(\boldsymbol{z}\vert \mathbf{x}_i,\boldsymbol{\lambda})\vert\vert p(\boldsymbol{z})]$$

We can finally conclude. The opposite of the ELBO, as a function of $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$, is used as the loss function of the VAE. We use SGD to minimize this loss, i.e., maximize the ELBO. Since the ELBO is a lower bound on the evidence, this goes in the direction of maximizing the evidence, and thus generating new images which are optimally similar to those in the training set. The first term in the ELBO is the expected negative log-likelihood of the training set points, thus it encourages the decoder to produce images which are similar to the training ones. The second term can be interpreted as a regularizer: it encourages the encoder to generate a distribution for the latent variables which is similar to $p(\boldsymbol{z})=\mathcal{N}(0,I)$. But by introducing the probability model first, we understood where the whole expression comes from: the minimization of the Kullabck-Leibler divergence between the approximate posterior $q_{\boldsymbol{\theta}}(\boldsymbol{z}\vert \mathbf{x},\boldsymbol{\lambda})$ and the model posterior $p(\boldsymbol{z}\vert \mathbf{x},\boldsymbol{\lambda})$.2

Once we have learned $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$ by maximizing $ELBO(\boldsymbol{\theta},\boldsymbol{\phi})$, we can throw away the encoder. From now on, to generate new images just sample $\boldsymbol{z}\sim \mathcal{N}(0,I)$ and propagate it through the decoder. The decoder outputs will be images similar to those in the training set.

References and further reading


1 This assumption is not strictly necessary, though it simplifies our description of VAEs. However, depending on applications, you may assume a different distribution for $p_{\phi}(\mathbf{x}\vert\mathbf{z})$. For example, if $\mathbf{x}$ is a vector of binary variables, a Gaussian $p$ makes no sense, and a multivariate Bernoulli can be assumed.

2 The ELBO expression, with its mathematical elegance, conceals two major sources of pain for the VAE practitioners. One is the average term $E_{q_{\boldsymbol{\theta}}(\boldsymbol{z}\vert \mathbf{x}_i,\boldsymbol{\lambda})}[\log p_{\boldsymbol{\phi}}(\mathbf{x}_i\vert\boldsymbol{z})]$. This effectively requires computing an expectation, which requires taking multiple samples from $q_{\boldsymbol{\theta}}(\boldsymbol{z}\vert \mathbf{x}_i,\boldsymbol{\lambda})$. Given the sizes of the involved neural networks, and the low convergence rate of the SGD algorithm, having to draw multiple random samples at each iteration (actually, for each minibatch, which is even worse) is very time-consuming. VAE users solve this problem very pragmatically by computing that expectation with a single (!) random sample. The other issue is that to train two neural networks (encoder & decoder) with the backpropagation algorithm, I need to be able to differentiate all steps involved in forward propagation from the encoder to the decoder. Since the decoder is not deterministic (evaluating its output requires drawing from a multivariate Gaussian), it doesn't even make sense to ask if it's a differentiable architecture. The solution to this is the reparametrization trick.

Related Question