As shown in the figure, the encoder produces $\mu$ and $\sigma$, which are the mean and standard deviation of the posterior distribution $q(z|x)$.
The random effect comes from drawing samples from the posterior distribution. Each sample $z$ can be obtained by $z = \mu+\sigma\epsilon$, where $\epsilon$ is a sample from a Gaussian distribution with zero mean and unit variance, which can be easily obtained.
The reason why we need the randomness is, the reconstruction error term in the loss function is actually an expectation over the posterior distribution which is difficult to solve analytically, so we use sample mean to approximate it.
For the derivation of the regularization term you can refer to the original paper https://arxiv.org/abs/1312.6114
You seem to have misunderstood your architecture and are, quite simply, overfitting your data.
It looks like your interpretation of the latent space is that it represents a manifold of realistic-looking images. That is unlikely in the best case, and if your decoder performs any transformation (except perhaps an affine transformation) on the sampling outputs - impossible.
Autoencoders (or rather the encoder component of them) in general are compression algorithms. This means that they approximate 'real' data with a smaller set of more abstract features.
For example, a string '33333333000000000669111222222' could be losslessly compressed by a very simplistic algorithm to '8:3/9:0/2:6/1:9/3:1/6:2' - occurences:number, maintaining position. If your criterion was length of text, the encoding is six characters shorter - not a huge improvement, but an improvement nonetheless.
What happened was we've introduced an abstract, higher-dimensional feature - 'number of repetitions' - that helps us express the original data more tersely. You could compress the output further; for example, noticing that even positions are just separators, you could encode them as a single-bit padding rather than an ASCII code.
Autoencoders do exactly that, except they get to pick the features themselves, and variational autoencoders enforce that the final level of coding (at least) is fuzzy in a way that can be manipulated.
So what you do in your model is that you're describing your input image using over sixty-five thousand features. And in a variational autoencoder, each feature is actually a sliding scale between two distinct versions of a feature, e.g. male/female for faces, or wide/thin brushstroke for MNIST digits.
Can you think of just a hundred ways to describe the differences between two realistic pictures in a meaningful way? Possible, I suppose, but they'll get increasingly forced as you try to go on.
With so much room to spare, the optimizer can comfortably encode each distinct training image's features in a non-overlapping slice of the latent space rather than learning the features of the training data taken globally.
So, when you feed it a validation picture, its encoding lands somewhere between islands of locally applicable feature encodings and so the result is entirely incoherent.
Best Answer
Yes, it has been done. The following paper implements something of that form:
Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders. Nat Dilokthanakul, Pedro A.M. Mediano, Marta Garnelo, Matthew C.H. Lee, Hugh Salimbeni, Kai Arulkumaran, Murray Shanahan.
They experiment with using this approach for clustering. Each Gaussian in the Gaussian mixture corresponds to a different cluster. Because the Gaussian mixture is in the latent space ($z$), and there is a neural network connecting $z$ to $x$, this allows non-trivial clusters in the input space ($x$).
That paper also mentions the following blog post, which experiments with a different variation on that architecture: http://ruishu.io/2016/12/25/gmvae/
Thanks to shimao for pointing this out.