Solved – Why does enforcing a prior distribution create semantic latent variables in variational autoencoders

autoencodersgenerative-models

Variational autoencoders create latent variables that have a known distribution (e.g., Gaussian with zero mean and unit variance), and so do adversarial autoencoders. I understand why this turns the autoencoder into a generative model, but somehow this also causes the latent variables to have semantic meaning. For example, when "walking in latent space" of an encoder trained on face images, one of the coordinates of the latent vector might gradually transition the generated images from male to female, and another coordinate might gradually change the age, another might affect the appearance of sunglasses. What is the theoretical explanation for this?

Best Answer

It doesn't. Not on its own. That property is a product of the interaction between the elements that comprise the model.

First there's the assumption of (normal) distribution of each variable and reconstruction from sampling that distribution, which means encoded values represent N real-valued features (N being the dimensionality of the latent space).

Leave that out altogether and of course, you no longer have a VAE - just a bunch of arbitrary encodings that happen to represent the true values... somehow.

Reconstruction loss, of course, is what enforces your encoded feature is meaningful - without it, you're not really learning anything.


Second comes the constraint on the values of the distribution parametrs - typically KLD between the parameters and a zero-mean, unit-stddev Gaussian. It's effectively a form of centripetal force that anchors the representations together.

That's what gives you an interpolatable latent space. If you used it as the only loss, the optimization will output all the encodings as random means close to zero, so you'll arrange them into something like an N-dimensional bubble in the latent space.

Leave it out on the other hand and chances are the values will drift apart across the manifold. Why?


You're training the model on samples where features vary in some way - for example, you're showing the model samples of the same face at different rotations, or two identically positioned faces of two different people, or a combination of both...

This means the representation of each face needs to somehow account for the variation in the facial features in the training set in the bottleneck, and won't you know it - we've already set up the encoding to be something perfect for that purpose.

The differences between samples that cannot be explained away as noise push the means apart to allow sampling distinct subclasses - if the means are close together, the sampled feature shows up equally likely in the decoded data in both classes, so it'd be a waste of precious bottleneck neurons as far as the optimizer cares.


So, the semantic features come from trying to best represent N most semantically salient variables in the training data (as extracted by the preceding encoder layers).

Encoding as mean+stddev ensures the features are interpolatable, and the KL-Div term limits the space between feature encodings to minimize the gaps in interpolation.

Related Question