Solved – Why aren’t auto-encoders also considered generative models

autoencodersdefinitiongenerative-modelsmachine learningvariational-bayes

Auto-encoders (AEs) are composed of an encoder and a decoder (often represented by a neural network). The encoder produces a vector representation $z$ of its input $x$ (e.g. an image). The decoder attempts to output an object similar to $x$, $x'$, given $z$.

Variational auto-encoders (VAEs) are auto-encoders that model the hidden (or latent) space, which is the output space of the encoder, and the output space of the decoder as a probability distribution (e.g. a Gaussian). Hence, we can sample different $z$ from this hidden space, for each $x$, and, similarly, we can sample different $x'$ from the output space of the decoder (given the sampled $z$). For this reason, VAEs are considered generative models.

The difference between AEs and VAEs is that VAEs are considered generative models, whereas standard AEs are not. Why is that the case? Is it only because they are usually not formalised as generative models? More precisely, why are VAEs generative models (according to the definition of a generative model) and AEs are not? For example, a denoising AE can attempt to generate an original denoised image, so, in this sense, it is a generative model. Could AEs be formulated as generative models?

Best Answer

My answer to this question would be the following:

A generative model, as defined on the Wikipedia link you provided, aims to estimate the joint distribution of your data and some latent (random) variables usually $p(\textbf{x},\textbf{z})$. Particularly, in the case of the VAE you have that the data (usually $\textbf{x}$) are the images, text, audio or whatever you are modeling and the latent variable (usually $\textbf{z}$) is a multivariate normal (you can relax this). In the AE you cannot make this analogy, you have your data, you map it to a space which lies on a smaller dimension than your original image, and you try to decode this lower dimensional data into your original data. This means, there are no distributional assumptions of how your data is generated. In probabilistic reasoning lingo, there are no assumptions on the data generation process.

When I started studying VAEs I thought of them as "just" probabilistic AEs but now I really don't like that way of looking at them. The intuition I have built around VAEs and the use of neural networks is the following: you build your model on a data generation process, particularly, you think that there is a latent variable per observation, estimating each latent variable per observation can be extremely expensive in classical variational inference so you use a function approximator (this is where the neural networks come in) and approximate the distribution of each latent variable using the observation itself. So the use of neural networks in probabilistic reasoning comes because of its approximating capabilities. Contrary to thinking that VAEs are just probabilistic extensions of neural networks.

Similarly, other models have been developed around the same intuition I tried to explain. For example deep Kalman filters, structural VAEs, etc.

EDIT: Note that my definition of generative model is a bit reductionist. There is a family of models called "auto-regressive" generative models that don't include a latent variable. In this case, you would be looking at a joint distribution of your variables as a factorization of the individual distributions conditional on the rest of the (previous) variables. Mathematically:

\begin{align} p(\textbf{x}) = p(x_{0}, x_{1},...,x_{N}) \\ &= \prod_{i=0}^{N}p(x_{i}|\textbf{x}_{i<}) \\ &= p(x_{N}|x_{N-1}, x_{N-2},...x_{0})p(x_{N-1}|x_{N-2}, x_{N-3},...x_{0})...p(x_{o}) \end{align}

Related Question