Autoencoders – Loss Function in Autoencoder vs Variational Autoencoder or MSE Loss vs Binary Cross Entropy Loss

autoencodersloss-functionsneural networkstensorflowvariational-bayes

When having real valued entries (e.g. floats between 0 and 1 as normalized representation for greyscale values from 0 to 256) in our label vector, I always thought that we use MSE(R2-loss) if we want to measure the distance/error between input and output or in general input and label of the network.
On the other hand, I also always thought, that binary cross entropy is only used, when we try to predict probabilities and the ground truth label entries are actual probabilities.

Now when working with the mnist dataset loaded via tensorflow like so:

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

Each entry is a float32 and ranges between 0 and 1.

The tensorflow tutorial for autoencoder uses R2-loss/MSE-loss for measuring the reconstruction loss.

Where as the tensorflow tutorial for variational autoencoder uses binary cross-entropy for measuring the reconstruction loss.

Can some please tell me WHY, based on the same dataset with same values (they are all numerical values which in effect represent pixel values) they use R2-loss/MSE-loss for the autoencoder and Binary-Cross-Entropy loss for the variational autoencoder.

I think it is needless to say, that both loss functions are applied on sigmoid outputs.

Best Answer

I don't believe there's some kind of deep, meaningful rationale at play here - it's a showcase example running on MNIST, it's pretty error-tolerant.


Optimizing for MSE means your generated output intensities are symmetrically close to the input intensities. A higher-than-training intensity is penalized by the same amount as an equally valued lower intensity.


Cross-entropy loss is assymetrical.

If your true intensity is high, e.g. 0.8, generating a pixel with the intensity of 0.9 is penalized more than generating a pixel with intensity of 0.7.

Conversely if it's low, e.g. 0.3, predicting an intensity of 0.4 is penalized less than a predicted intensity of 0.2.

You might have guessed by now - cross-entropy loss is biased towards 0.5 whenever the ground truth is not binary. For a ground truth of 0.5, the per-pixel zero-normalized loss is equal to 2*MSE.

This is quite obviously wrong! The end result is that you're training the network to always generate images that are blurrier than the inputs. You're actively penalizing any result that would enhance the output sharpness more than those that make it worse!


MSE is not immune to the this behavior either, but at least it's just unbiased and not biased in the completely wrong direction.

However, before you run off to write a loss function with the opposite bias - just keep in mind pushing outputs away from 0.5 will in turn mean the decoded images will have very hard, pixellized edges.

That is - or at least I very strongly suspect is - why adversarial methods yield better results - the adversarial component is essentially a trainable, 'smart' loss function for the (possibly variational) autoencoder.