Variational Autoencoders – What Activation Function to Use for the Last Layer of Decoder?

autoencodersneural networksvariational inferencevariational-bayes

I'm working on an implementation of a Variational Autoencoder (VAE). There are lots of helpful examples and guides out there, which typically introduce VAE in the context of image data, e.g. MNIST. Since pixels (input features) are scaled between zero and one – $x\in[0,1]$ – these examples use sigmoid activation in the last layer of the decoder $d: z \mapsto x$. That makes sense, but what if I cannot assume any knowledge about the scale of $x$? Do I just not use any activation at all? And if so, does that make learning significantly harder?

Any help would be much appreciated!

Best Answer

Your output activation function (in combination with your loss function) correspond to (implicit) assumptions you make about the distribution of the data:

  • logistic sigmoid + binary cross-entropy = Bernoulli assumption
  • identity function + mean squared error = Gaussian assumption
  • softmax function + cross-entropy = Categorical assumption

You can easily see the difference (also for plain AEs) by running the following experiment (on a simple dataset like MNIST):

  1. train an AE on images with pixel intensities in [0, 1] using logistic sigmoid + BCE
  2. train an AE on images with normalised pixel intensities (subtract mean and divide by standard deviation) using the identity funtion + MSE.
  3. observe the difference between the reproductions. The different assumptions should be clearly visible.

Obviously this explanation only explains what to do in a few limited cases. If your data follows some exotic distribution, there might not be a fixed/well-known loss function and/or activation function. However, it should provide a basis to make informed decisions about what functions to use to achieve good reconstructions.