You seem to have misunderstood your architecture and are, quite simply, overfitting your data.
It looks like your interpretation of the latent space is that it represents a manifold of realistic-looking images. That is unlikely in the best case, and if your decoder performs any transformation (except perhaps an affine transformation) on the sampling outputs - impossible.
Autoencoders (or rather the encoder component of them) in general are compression algorithms. This means that they approximate 'real' data with a smaller set of more abstract features.
For example, a string '33333333000000000669111222222' could be losslessly compressed by a very simplistic algorithm to '8:3/9:0/2:6/1:9/3:1/6:2' - occurences:number, maintaining position. If your criterion was length of text, the encoding is six characters shorter - not a huge improvement, but an improvement nonetheless.
What happened was we've introduced an abstract, higher-dimensional feature - 'number of repetitions' - that helps us express the original data more tersely. You could compress the output further; for example, noticing that even positions are just separators, you could encode them as a single-bit padding rather than an ASCII code.
Autoencoders do exactly that, except they get to pick the features themselves, and variational autoencoders enforce that the final level of coding (at least) is fuzzy in a way that can be manipulated.
So what you do in your model is that you're describing your input image using over sixty-five thousand features. And in a variational autoencoder, each feature is actually a sliding scale between two distinct versions of a feature, e.g. male/female for faces, or wide/thin brushstroke for MNIST digits.
Can you think of just a hundred ways to describe the differences between two realistic pictures in a meaningful way? Possible, I suppose, but they'll get increasingly forced as you try to go on.
With so much room to spare, the optimizer can comfortably encode each distinct training image's features in a non-overlapping slice of the latent space rather than learning the features of the training data taken globally.
So, when you feed it a validation picture, its encoding lands somewhere between islands of locally applicable feature encodings and so the result is entirely incoherent.
My answer to this question would be the following:
A generative model, as defined on the Wikipedia link you provided, aims to estimate the joint distribution of your data and some latent (random) variables usually $p(\textbf{x},\textbf{z})$. Particularly, in the case of the VAE you have that the data (usually $\textbf{x}$) are the images, text, audio or whatever you are modeling and the latent variable (usually $\textbf{z}$) is a multivariate normal (you can relax this). In the AE you cannot make this analogy, you have your data, you map it to a space which lies on a smaller dimension than your original image, and you try to decode this lower dimensional data into your original data. This means, there are no distributional assumptions of how your data is generated. In probabilistic reasoning lingo, there are no assumptions on the data generation process.
When I started studying VAEs I thought of them as "just" probabilistic AEs but now I really don't like that way of looking at them. The intuition I have built around VAEs and the use of neural networks is the following: you build your model on a data generation process, particularly, you think that there is a latent variable per observation, estimating each latent variable per observation can be extremely expensive in classical variational inference so you use a function approximator (this is where the neural networks come in) and approximate the distribution of each latent variable using the observation itself. So the use of neural networks in probabilistic reasoning comes because of its approximating capabilities. Contrary to thinking that VAEs are just probabilistic extensions of neural networks.
Similarly, other models have been developed around the same intuition I tried to explain. For example deep Kalman filters, structural VAEs, etc.
EDIT:
Note that my definition of generative model is a bit reductionist. There is a family of models called "auto-regressive" generative models that don't include a latent variable. In this case, you would be looking at a joint distribution of your variables as a factorization of the individual distributions conditional on the rest of the (previous) variables. Mathematically:
\begin{align}
p(\textbf{x}) = p(x_{0}, x_{1},...,x_{N}) \\
&= \prod_{i=0}^{N}p(x_{i}|\textbf{x}_{i<}) \\
&= p(x_{N}|x_{N-1}, x_{N-2},...x_{0})p(x_{N-1}|x_{N-2}, x_{N-3},...x_{0})...p(x_{o})
\end{align}
Best Answer
Question 1:
The output of the decoder aims to model the distribution $p(x|t)$, i.e. the distribution of data $x$ given the latent variable $t$. Therefore, in principle, it should always be probabilistic.
However, in some cases, people simply use the mean squared error as the loss and, as you said, the output of the decoder is the actual predicted data points. Note that this approach can also be viewed as probabilistic, in the sense that it is is equivalent to modeling $p(x|t)$ as Gaussian with identity covariance, $p(x|t) = \mathcal{N}(x|\mu(t), I)$. In this case, the output of the decoder is the mean $\mu(t)$ and, therefore, for an example $x_i$, you get the following reconstruction loss:
\begin{align} -\log(p(x_i | t_i)) &= -\log \left(\frac{1}{\sqrt{(2\pi)^k |I|}} \exp \left(-\frac{1}{2}(x_i-\mu(t_i))^\intercal I (x_i-\mu(t_i))\right)\right) \\ &= \frac{1}{2}||x_i - \mu(t_i)||^2 + \text{const.} \end{align}
which, as you can see, is proportional to the mean squared error (plus some constant).
Question 2:
This is not quite true. The correct statement would be that Bernoulli distribution makes sense for black and white (i.e. binary) images. The Bernoulli distribution is binary, so it assumes that observations may only have two possible outcomes. It is true that people sometimes use it for grayscale images, but this is an abusive interpretation of the VAE. It may work pretty well for datasets that are almost black and white, like MNIST. However, a binarized version of the MNIST dataset exists and, in rigor, this is the version that should be used together with a Bernoulli VAE.
I would try a Gaussian first, $p(x|t) = \mathcal{N}(x|\mu(t), \sigma^2(t))$, so the decoder would output two values, $\mu(t)$ and $\sigma^2(t)$. But yeah, if you really want a t-distribution, then that is the way to go.