As per this and this answer, autoencoders seem to be a technique that uses neural networks for dimension reduction. I would like to additionally know what is a variational autoencoder (its main differences/benefits over a "traditional" autoencoders) and also what are the main learning tasks these algorithms are used for.
Variational Autoencoders – What Are Variational Autoencoders and Their Applications in Learning Tasks
autoencodersbayesiandeep learningmachine learningvariational-bayes
Related Solutions
Unsupervised, layer-wise pretraining was one of the early innovations that made it possible to use deep networks in practice. Since then, other tricks have been discovered that made layerwise pre-training unnecessary in many cases. Rectified linear units (ReLUs) are one example.
Glorot et al. (2011). Deep Sparse Rectifier Neural Networks.
Using deep autoencoders with ReLUs, they found that unsupervised pretraining was unnecessary and, in some cases, performance was better without it. But, they did find that unsupervised pretraining can help in a semi-supervised setting, when unlabeled data is available.
Optimization methods are another class of tricks. For example, Hessian-free (HF) optimization uses second order information to compute the update directions. The following paper found that HF optimization made unsupervised pre-training unnecessary for training deep autoencoders.
Martens (2010). Deep learning via Hessian-free optimization.
My impression is that unsupervised, layer-wise pretraining has generally fallen out of favor, except for specific circumstances (e.g. the semisupervised case). For related discussion, see here.
Before I attempt to answer your question I want to create a stronger separation between the methods you are referring to.
The first set of methods I believe you are referring to are neighborhood based dimensionality reduction methods, where a neighborhood graph is constructed where the edges represent a distance metric. Now to play devil's advocate against myself, MDS/ISOMAP can both be interpreted as a form of kernel PCA. So although this distinction seems relatively sharp, various interpretation shift these methods from one class to another.
The second set of methods you are referring to I would place in the field of unsupervised neural network learning. Autoencoders are a special architecture that attempts to map an input space into a lower-dimensional space that allows decoding back to the input space with minimal loss in information.
First, let's talk about benefits and drawbacks of autoencoders. Autoencoders are generally trained using some variant of stochastic gradient descent which yields some advantages. The dataset does not have to fit into memory, and can dynamically be loaded up and trained with gradient descent. Unlike a lot of methods in neighborhood-based learning which forces the dataset to exist in memory. The architecture of the autoencoders allows prior knowledge of the data to be incorporated into the model. For example, if are dataset contains images we can create an architecture that utilizes 2d convolution. If the dataset contains time-series that have long term connections, we can use gated recurrent networks (check out Seq2Seq learning). This is the power of neural networks in general. It allows us to encode prior knowledge about the problem into our models. This is something that other models, and to be more specific, dimensionality reduction algorithms cannot do.
From a theoretical perspective, there are a couple nice theorems. The deeper the network, the complexity of functions that are learnable by the network increases exponentially. In general, at least before something new is discovered, you are not going to find a more expressive/powerful model than a correctly selected neural network.
Now although all this sounds great, there are drawbacks. Convergence of neural networks is non-deterministic and depends heavily on the architecture used, the complexity of the problem, choice of hyper-parameters, etc. The expressiveness of neural networks causes problems too, they tend to overfit very quickly if the right regularization is not chosen/used.
On the other hand, neighborhood methods are less expressive and tend to run a deterministic amount of time until convergence based on much fewer parameters than neural networks.
The choice of method depends directly on the problem. If you have a small dataset that fits in memory and does not utilize any type of structured data (images, videos, audio) classical dimensionality reduction would probably be the way to go. But as structure is introduced, the complexity of your problem increases, and the amount of data you have grows neural networks become the correct choice.
Hope this helps.
Best Answer
Even though variational autoencoders (VAEs) are easy to implement and train, explaining them is not simple at all, because they blend concepts from Deep Learning and Variational Bayes, and the Deep Learning and Probabilistic Modeling communities use different terms for the same concepts. Thus when explaining VAEs you risk either concentrating on the statistical model part, leaving the reader without a clue about how to actually implement it, or vice versa to concentrate on the network architecture and loss function, in which the Kullback-Leibler term seems to be pulled out of thin air. I'll try to strike a middle ground here, starting from the model but giving enough details to actually implement it in practice, or understand someone's else implementation.
VAEs are generative models
Unlike classical (sparse, denoising, etc.) autoencoders, VAEs are generative models, like GANs. With generative model I mean a model which learns the probability distribution $p(\mathbf{x})$ over the input space $\mathcal{x}$. This means that after we have trained such a model, we can then sample from (our approximation of) $p(\mathbf{x})$. If our training set is made of handwritten digits (MNIST), then after training the generative model is able to create images which look like handwritten digits, even though they're not "copies" of the images in the training set.
Learning the distribution of the images in the training set implies that images which look like handwritten digits should have an high probability of being generated, while images which look like the Jolly Roger or random noise should have a low probability. In other words, it means learning about the dependencies among pixels: if our image is a $28\times 28=784$ pixels grayscale image from MNIST, the model should learn that if a pixel is very bright, then there's a significant probability that some neighboring pixels are bright too, that if we have a long, slanted line of bright pixels we may have another smaller, horizontal line of pixels above this one (a 7), etc.
VAEs are latent variable models
The VAE is a latent variables model: this means that $\mathbf{x}$, the random vector of the 784 pixel intensities (the observed variables), is modeled as a (possibly very complicated) function of a random vector $\mathbf{z}\in\mathcal{Z}$ of lower dimensionality, whose components are unobserved (latent) variables. When does such a model make sense? For example, in the MNIST case we think that the handwritten digits belong to a manifold of dimension much smaller than the dimension of $\mathcal{x}$, because the vast majority of random arrangements of 784 pixel intensities, don't look at all like handwritten digit. Intuitively we would expect the dimension to be at least 10 (the number of digits), but it's most likely larger because each digit can be written in different ways. Some differences are unimportant for the quality of the final image (for example, global rotations and translations), but others are important. So in this case the latent model makes sense. More on this later. Note that, amazingly, even if our intuition tells us that the dimension should about 10, we can definitely use just 2 latent variables to encode the MNIST dataset with a VAE (though results won't be pretty). The reason is that even a single real variable can encode infinitely many classes, because it can assume all possible integer values and more. Of course, if the classes have significant overlap among them (such as 9 and 8 or 7 and I in MNIST), even the most complicated function of just two latent variables will do a poor job of generating clearly discernible samples for each class. More on this later.
VAEs assume a multivariate parametric distribution $q(\mathbf{z}\vert\mathbf{x},\boldsymbol{\lambda})$ (where $\boldsymbol{\lambda}$ are the parameters of $q$), and they learn the parameters of the multivariate distribution. The use of a parametric pdf for $\mathbf{z}$, which prevents the number of parameters of a VAE to grow without bounds with the growth of the training set, is called amortization in VAE lingo (yeah, I know...).
The decoder network
We start from the decoder network because the VAE is a generative model, and the only part of the VAE which is actually used to generate new images is the decoder. The encoder network is only used at inference (training) time.
The goal of the decoder network is to generate new random vectors $\mathbf{x}$ belonging to the input space $\mathcal{X}$, i.e., new images, starting from realizations of the latent vector $\mathbf{z}$. This means clearly that it must learn the conditional distribution $p(\mathbf{x}\vert\mathbf{z})$. For VAEs this distribution is often assumed to be a multivariate Gaussian1:
$$p_{\boldsymbol{\phi}}(\mathbf{x}\vert\mathbf{z}) = \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}(\mathbf{z}; \boldsymbol{\phi}), \boldsymbol{\sigma}(\mathbf{z}; \boldsymbol{\phi})^2I) $$
$\boldsymbol{\phi}$ is the vector of weights (and biases) of the encoder network. The vectors $\boldsymbol{\mu}(\mathbf{z};\boldsymbol{\phi})$ and $\boldsymbol{\sigma}(\mathbf{z}; \boldsymbol{\phi})$ are complex, unknown nonlinear functions, modeled by the decoder network: neural networks are powerful nonlinear functions approximators.
As noted by @amoeba in the comments, there is a striking similarity between the decoder and a classic latent variables model: Factor Analysis. In Factor Analysis, you assume the model:
$$ \mathbf{x}\vert\mathbf{z}\sim\mathcal{N}(\mathbf{W}\mathbf{z}+\boldsymbol{\mu}, \boldsymbol{\sigma}^2I),\ \mathbf{z}\sim\mathcal{N}(0,I)$$
Both models (FA & the decoder) assume that the conditional distribution of the observable variables $\mathbf{x}$ on the latent variables $\mathbf{z}$ is Gaussian, and that the $\mathbf{z}$ themselves are standard Gaussians. The difference is that the decoder doesn't assume that the mean of $p(\mathbf{x}|\mathbf{z})$ is linear in $\mathbf{z}$, nor it assumes that the standard deviation is a constant vector. On the contrary, it models them as complex nonlinear functions of the $\mathbf{z}$. In this respect, it can be seen as nonlinear Factor Analysis. See here for an insightful discussion of this connection between FA and VAE. Since FA with an isotropic covariance matrix is just PPCA, this also ties in to the well-known result that a linear autoencoder reduces to PCA.
Let's go back to the decoder: how do we learn $\boldsymbol{\phi}$? Intuitively we want latent variables $\mathbf{z}$ which maximize the likelihood of generating the $\mathbf{x}_i$ in the training set $D_n$. In other words we want to compute the posterior probability distribution of the $\mathbf{z}$, given the data:
$$p(\mathbf{z}\vert\mathbf{x})=\frac{p_{\boldsymbol{\phi}}(\mathbf{x}\vert\mathbf{z})p(\mathbf{z})}{p(\mathbf{x})}$$
We assume a $\mathcal{N}(0,I)$ prior on $\mathbf{z}$, and we're left with the usual issue in Bayesian inference that computing $p(\mathbf{x})$ (the evidence) is hard (a multidimensional integral). What's more, since here $\boldsymbol{\mu}(\mathbf{z};\boldsymbol{\phi})$ is unknown, we can't compute it anyway. Enter Variational Inference, the tool which gives Variational Autoencoders their name.
Variational Inference for the VAE model
Variational Inference is a tool to perform approximate Bayesian Inference for very complex models. It's not an overly complex tool, but my answer is already too long and I won't go into a detailed explanation of VI. You can have a look at this answer and the references therein if you're curious:
https://stats.stackexchange.com/a/270569/58675
It suffices to say that VI looks for an approximation to $p(\mathbf{z}\vert \mathbf{x})$ in a parametric family of distributions $q(\mathbf{z}\vert \mathbf{x},\boldsymbol{\lambda})$, where, as noted above, $\boldsymbol{\lambda}$ are the parameters of the family. We look for the parameters which minimize the Kullback-Leibler divergence between our target distribution $p(\mathbf{z}\vert \mathbf{x})$ and $q(\mathbf{z}\vert \mathbf{x},\boldsymbol{\lambda})$:
$$\min_{\boldsymbol{\lambda}}\mathcal{D}[p(\mathbf{z}\vert \mathbf{x})\vert\vert q(\mathbf{z}\vert \mathbf{x},\boldsymbol{\lambda})]$$
Again, we cannot minimize this directly because the definition of Kullback-Leibler divergence includes the evidence. Introducing the ELBO (Evidence Lower BOund) and after some algebraic manipulations, we finally get at:
$$ELBO(\boldsymbol{\lambda})= E_{q(\boldsymbol{z}\vert \mathbf{x},\boldsymbol{\lambda})}[\log p(\mathbf{x}\vert\boldsymbol{z})]-\mathcal{D}[(q(\boldsymbol{z}\vert \mathbf{x},\boldsymbol{\lambda})\vert\vert p(\boldsymbol{z})]$$
Since the ELBO is a lower bound on evidence (see the above link), maximizing the ELBO is not exactly equivalent to maximizing the likelihood of data given $\boldsymbol{\lambda}$ (after all, VI is a tool for approximate Bayesian inference), but it goes in the right direction.
In order to make inference, we need to specify the parametric family $q(\boldsymbol{z}\vert \mathbf{x},\boldsymbol{\lambda})$. In most VAEs we choose a multivariate, uncorrelated Gaussian distribution
$$q(\mathbf{z}\vert \mathbf{x},\boldsymbol{\lambda}) = \mathcal{N}(\mathbf{z}\vert\boldsymbol{\mu}(\mathbf{x}), \boldsymbol{\sigma}^2(\mathbf{x})I) $$
This is the same choice we made for $p(\mathbf{x}\vert\mathbf{z})$, though we may have chosen a different parametric family. As before, we can estimate these complex nonlinear functions by introducing a neural network model. Since this model accepts input images and returns parameters of the distribution of the latent variables we call it the encoder network. As before, we can estimate these complex nonlinear functions by introducing a neural network model. Since this model accepts input images and returns parameters of the distribution of the latent variables we call it the encoder network.
The encoder network
Also called the inference network, this is only used at training time.
As noted above, the encoder must approximate $\boldsymbol{\mu}(\mathbf{x})$ and $\boldsymbol{\sigma}(\mathbf{x})$, thus if we have, say, 24 latent variables, the output of the encoder is a $d=48$ vector. The encoder has weights (and biases) $\boldsymbol{\theta}$. To learn $\boldsymbol{\theta}$, we can finally write the ELBO in terms of the parameters $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$ of the encoder and decoder network, as well as the training set points:
$$ELBO(\boldsymbol{\theta},\boldsymbol{\phi})= \sum_i E_{q_{\boldsymbol{\theta}}(\boldsymbol{z}\vert \mathbf{x}_i,\boldsymbol{\lambda})}[\log p_{\boldsymbol{\phi}}(\mathbf{x}_i\vert\boldsymbol{z})]-\mathcal{D}[(q_{\boldsymbol{\theta}}(\boldsymbol{z}\vert \mathbf{x}_i,\boldsymbol{\lambda})\vert\vert p(\boldsymbol{z})]$$
We can finally conclude. The opposite of the ELBO, as a function of $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$, is used as the loss function of the VAE. We use SGD to minimize this loss, i.e., maximize the ELBO. Since the ELBO is a lower bound on the evidence, this goes in the direction of maximizing the evidence, and thus generating new images which are optimally similar to those in the training set. The first term in the ELBO is the expected negative log-likelihood of the training set points, thus it encourages the decoder to produce images which are similar to the training ones. The second term can be interpreted as a regularizer: it encourages the encoder to generate a distribution for the latent variables which is similar to $p(\boldsymbol{z})=\mathcal{N}(0,I)$. But by introducing the probability model first, we understood where the whole expression comes from: the minimization of the Kullabck-Leibler divergence between the approximate posterior $q_{\boldsymbol{\theta}}(\boldsymbol{z}\vert \mathbf{x},\boldsymbol{\lambda})$ and the model posterior $p(\boldsymbol{z}\vert \mathbf{x},\boldsymbol{\lambda})$.2
Once we have learned $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$ by maximizing $ELBO(\boldsymbol{\theta},\boldsymbol{\phi})$, we can throw away the encoder. From now on, to generate new images just sample $\boldsymbol{z}\sim \mathcal{N}(0,I)$ and propagate it through the decoder. The decoder outputs will be images similar to those in the training set.
References and further reading
1 This assumption is not strictly necessary, though it simplifies our description of VAEs. However, depending on applications, you may assume a different distribution for $p_{\phi}(\mathbf{x}\vert\mathbf{z})$. For example, if $\mathbf{x}$ is a vector of binary variables, a Gaussian $p$ makes no sense, and a multivariate Bernoulli can be assumed.
2 The ELBO expression, with its mathematical elegance, conceals two major sources of pain for the VAE practitioners. One is the average term $E_{q_{\boldsymbol{\theta}}(\boldsymbol{z}\vert \mathbf{x}_i,\boldsymbol{\lambda})}[\log p_{\boldsymbol{\phi}}(\mathbf{x}_i\vert\boldsymbol{z})]$. This effectively requires computing an expectation, which requires taking multiple samples from $q_{\boldsymbol{\theta}}(\boldsymbol{z}\vert \mathbf{x}_i,\boldsymbol{\lambda})$. Given the sizes of the involved neural networks, and the low convergence rate of the SGD algorithm, having to draw multiple random samples at each iteration (actually, for each minibatch, which is even worse) is very time-consuming. VAE users solve this problem very pragmatically by computing that expectation with a single (!) random sample. The other issue is that to train two neural networks (encoder & decoder) with the backpropagation algorithm, I need to be able to differentiate all steps involved in forward propagation from the encoder to the decoder. Since the decoder is not deterministic (evaluating its output requires drawing from a multivariate Gaussian), it doesn't even make sense to ask if it's a differentiable architecture. The solution to this is the reparametrization trick.