As shown in the figure, the encoder produces $\mu$ and $\sigma$, which are the mean and standard deviation of the posterior distribution $q(z|x)$.
The random effect comes from drawing samples from the posterior distribution. Each sample $z$ can be obtained by $z = \mu+\sigma\epsilon$, where $\epsilon$ is a sample from a Gaussian distribution with zero mean and unit variance, which can be easily obtained.
The reason why we need the randomness is, the reconstruction error term in the loss function is actually an expectation over the posterior distribution which is difficult to solve analytically, so we use sample mean to approximate it.
For the derivation of the regularization term you can refer to the original paper https://arxiv.org/abs/1312.6114
I don't believe there's some kind of deep, meaningful rationale at play here - it's a showcase example running on MNIST, it's pretty error-tolerant.
Optimizing for MSE means your generated output intensities are symmetrically close to the input intensities. A higher-than-training intensity is penalized by the same amount as an equally valued lower intensity.
Cross-entropy loss is assymetrical.
If your true intensity is high, e.g. 0.8, generating a pixel with the intensity of 0.9 is penalized more than generating a pixel with intensity of 0.7.
Conversely if it's low, e.g. 0.3, predicting an intensity of 0.4 is penalized less than a predicted intensity of 0.2.
You might have guessed by now - cross-entropy loss is biased towards 0.5 whenever the ground truth is not binary. For a ground truth of 0.5, the per-pixel zero-normalized loss is equal to 2*MSE.
This is quite obviously wrong! The end result is that you're training the network to always generate images that are blurrier than the inputs. You're actively penalizing any result that would enhance the output sharpness more than those that make it worse!
MSE is not immune to the this behavior either, but at least it's just unbiased and not biased in the completely wrong direction.
However, before you run off to write a loss function with the opposite bias - just keep in mind pushing outputs away from 0.5 will in turn mean the decoded images will have very hard, pixellized edges.
That is - or at least I very strongly suspect is - why adversarial methods yield better results - the adversarial component is essentially a trainable, 'smart' loss function for the (possibly variational) autoencoder.
Best Answer
A VAE models a distribution
$$P(x) = \int P(x|z)P(z) dz$$
When the output is continuous valued, then a common parameterization of $P(x|z)$ is as $$\mathcal{N}(\mu = f(z;\theta), \sigma^2)$$
Recall that a VAE is trained by maximizing the variational lower bound, which can be broken down into two terms:
$$E_{z \sim q}[\log P(x|z)] - \text{KL}(q(z)||p(z))$$
But then the first term is merely the log of the gaussian density, which is (up to some scaling and constants) $(x-\mu)^2$. Hence MSE.