I don't believe there's some kind of deep, meaningful rationale at play here - it's a showcase example running on MNIST, it's pretty error-tolerant.
Optimizing for MSE means your generated output intensities are symmetrically close to the input intensities. A higher-than-training intensity is penalized by the same amount as an equally valued lower intensity.
Cross-entropy loss is assymetrical.
If your true intensity is high, e.g. 0.8, generating a pixel with the intensity of 0.9 is penalized more than generating a pixel with intensity of 0.7.
Conversely if it's low, e.g. 0.3, predicting an intensity of 0.4 is penalized less than a predicted intensity of 0.2.
You might have guessed by now - cross-entropy loss is biased towards 0.5 whenever the ground truth is not binary. For a ground truth of 0.5, the per-pixel zero-normalized loss is equal to 2*MSE.
This is quite obviously wrong! The end result is that you're training the network to always generate images that are blurrier than the inputs. You're actively penalizing any result that would enhance the output sharpness more than those that make it worse!
MSE is not immune to the this behavior either, but at least it's just unbiased and not biased in the completely wrong direction.
However, before you run off to write a loss function with the opposite bias - just keep in mind pushing outputs away from 0.5 will in turn mean the decoded images will have very hard, pixellized edges.
That is - or at least I very strongly suspect is - why adversarial methods yield better results - the adversarial component is essentially a trainable, 'smart' loss function for the (possibly variational) autoencoder.
The documentation says that this loss function is computed using the logloss of the softmax of $x$ (output
in your code). For your example, we have
$$
\begin{align}
-\log\left(\frac{\exp(x_j)}{\sum_i \exp (x_i)}\right)&= -x_j+\log\left(\sum_i \exp(x_i)\right) \\
&= -1 + \log\left( \exp(0) + \exp(1) + \exp(0) \right) \\
&= 0.5514.
\end{align}
$$
To achieve the desired result, you could either have your network output scores as described in the documentation, or else use a loss function that works directly with probabilities.
Best Answer
Cross entropy is defined on probability distributions, not single values. The reason it works for classification is that classifier output is (often) a probability distribution over class labels. For example, the outputs of logistic/softmax functions are interpreted as probabilities. The observed class label is also treated as a probability distribution: the empirical distribution (where the probability is 1 for the observed class and 0 for the others).
The concept of cross entropy applies equally well to continuous distributions. But, it can't be used for regression models that output a point estimate (e.g. the conditional mean) but not a full probability distribution. If you had a model that gave the full conditional distribution (probability of output given input), you could use cross entropy as a loss function.
For continuous distributions $p$ and $q$, the cross entropy is defined as: $$H(p, q) = -\int_{Y} p(y) \log q(y) dy$$
Just considering a single observed input/output pair $(x, y)$, $p$ would be the empirical conditional distribution (a delta function over the observed output value), and $q$ would be the modeled conditional distribution (probability of output given input). In this case, the cross entropy reduces to $-\log q(y \mid x)$. Summing over data points, this is just the negative log likelihood!