The encoder distribution is $q(z|x)=\mathcal{N}(z|\mu(x),\Sigma(x))$ where $\Sigma=\text{diag}(\sigma_1^2,\ldots,\sigma^2_n)$.
The latent prior is given by $p(z)=\mathcal{N}(0,I)$.
Both are multivariate Gaussians of dimension $n$, for which in general the KL divergence is:
$$
\mathfrak{D}_\text{KL}[p_1\mid\mid p_2] =
\frac{1}{2}\left[\log\frac{|\Sigma_2|}{|\Sigma_1|} - n + \text{tr} \{ \Sigma_2^{-1}\Sigma_1 \} + (\mu_2 - \mu_1)^T \Sigma_2^{-1}(\mu_2 - \mu_1)\right]
$$
where $p_1 = \mathcal{N}(\mu_1,\Sigma_1)$ and $p_2 = \mathcal{N}(\mu_2,\Sigma_2)$.
In the VAE case, $p_1 = q(z|x)$ and $p_2=p(z)$, so $\mu_1=\mu$, $\Sigma_1 = \Sigma$, $\mu_2=\vec{0}$, $\Sigma_2=I$. Thus:
\begin{align}
\mathfrak{D}_\text{KL}[q(z|x)\mid\mid p(z)]
&=
\frac{1}{2}\left[\log\frac{|\Sigma_2|}{|\Sigma_1|} - n + \text{tr} \{ \Sigma_2^{-1}\Sigma_1 \} + (\mu_2 - \mu_1)^T \Sigma_2^{-1}(\mu_2 - \mu_1)\right]\\
&= \frac{1}{2}\left[\log\frac{|I|}{|\Sigma|} - n + \text{tr} \{ I^{-1}\Sigma \} + (\vec{0} - \mu)^T I^{-1}(\vec{0} - \mu)\right]\\
&= \frac{1}{2}\left[-\log{|\Sigma|} - n + \text{tr} \{ \Sigma \} + \mu^T \mu\right]\\
&= \frac{1}{2}\left[-\log\prod_i\sigma_i^2 - n + \sum_i\sigma_i^2 + \sum_i\mu^2_i\right]\\
&= \frac{1}{2}\left[-\sum_i\log\sigma_i^2 - n + \sum_i\sigma_i^2 + \sum_i\mu^2_i\right]\\
&= \frac{1}{2}\left[-\sum_i\left(\log\sigma_i^2 + 1\right) + \sum_i\sigma_i^2 + \sum_i\mu^2_i\right]\\
\end{align}
You seem to have misunderstood your architecture and are, quite simply, overfitting your data.
It looks like your interpretation of the latent space is that it represents a manifold of realistic-looking images. That is unlikely in the best case, and if your decoder performs any transformation (except perhaps an affine transformation) on the sampling outputs - impossible.
Autoencoders (or rather the encoder component of them) in general are compression algorithms. This means that they approximate 'real' data with a smaller set of more abstract features.
For example, a string '33333333000000000669111222222' could be losslessly compressed by a very simplistic algorithm to '8:3/9:0/2:6/1:9/3:1/6:2' - occurences:number, maintaining position. If your criterion was length of text, the encoding is six characters shorter - not a huge improvement, but an improvement nonetheless.
What happened was we've introduced an abstract, higher-dimensional feature - 'number of repetitions' - that helps us express the original data more tersely. You could compress the output further; for example, noticing that even positions are just separators, you could encode them as a single-bit padding rather than an ASCII code.
Autoencoders do exactly that, except they get to pick the features themselves, and variational autoencoders enforce that the final level of coding (at least) is fuzzy in a way that can be manipulated.
So what you do in your model is that you're describing your input image using over sixty-five thousand features. And in a variational autoencoder, each feature is actually a sliding scale between two distinct versions of a feature, e.g. male/female for faces, or wide/thin brushstroke for MNIST digits.
Can you think of just a hundred ways to describe the differences between two realistic pictures in a meaningful way? Possible, I suppose, but they'll get increasingly forced as you try to go on.
With so much room to spare, the optimizer can comfortably encode each distinct training image's features in a non-overlapping slice of the latent space rather than learning the features of the training data taken globally.
So, when you feed it a validation picture, its encoding lands somewhere between islands of locally applicable feature encodings and so the result is entirely incoherent.
Best Answer
Typically in VAE implementations, the output of the decoder is actually the mean $\mu_{x|z}$ which I will just call $\mu$, and people assumes a unitary covariance. So in that case we have: $logP(x|z)=-\frac{1}{2}[log(|\Sigma|)+klog(2\pi)+(\mathbf{x}-\mathbf{\mu})^T(\mathbf{x}-\mathbf{\mu})]$
This comes from taking the log of the pdf of a multivariate Gaussian distribution. Now you can see that since the first two terms are constant with respect to $\mu$, the optimization problem is equivalent to maximize $-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T(\mathbf{x}-\boldsymbol{\mu})$ which is the just the L2 loss between $\mathbf{x}$ and $\boldsymbol{\mu}$. Finally the expectation is just approximated by averaging.