I'm working on an implementation of a Variational Autoencoder (VAE). There are lots of helpful examples and guides out there, which typically introduce VAE in the context of image data, e.g. MNIST. Since pixels (input features) are scaled between zero and one – $x\in[0,1]$ – these examples use sigmoid activation in the last layer of the decoder $d: z \mapsto x$. That makes sense, but what if I cannot assume any knowledge about the scale of $x$? Do I just not use any activation at all? And if so, does that make learning significantly harder?
Any help would be much appreciated!
Best Answer
Your output activation function (in combination with your loss function) correspond to (implicit) assumptions you make about the distribution of the data:
You can easily see the difference (also for plain AEs) by running the following experiment (on a simple dataset like MNIST):
Obviously this explanation only explains what to do in a few limited cases. If your data follows some exotic distribution, there might not be a fixed/well-known loss function and/or activation function. However, it should provide a basis to make informed decisions about what functions to use to achieve good reconstructions.