My answer to this question would be the following:
A generative model, as defined on the Wikipedia link you provided, aims to estimate the joint distribution of your data and some latent (random) variables usually $p(\textbf{x},\textbf{z})$. Particularly, in the case of the VAE you have that the data (usually $\textbf{x}$) are the images, text, audio or whatever you are modeling and the latent variable (usually $\textbf{z}$) is a multivariate normal (you can relax this). In the AE you cannot make this analogy, you have your data, you map it to a space which lies on a smaller dimension than your original image, and you try to decode this lower dimensional data into your original data. This means, there are no distributional assumptions of how your data is generated. In probabilistic reasoning lingo, there are no assumptions on the data generation process.
When I started studying VAEs I thought of them as "just" probabilistic AEs but now I really don't like that way of looking at them. The intuition I have built around VAEs and the use of neural networks is the following: you build your model on a data generation process, particularly, you think that there is a latent variable per observation, estimating each latent variable per observation can be extremely expensive in classical variational inference so you use a function approximator (this is where the neural networks come in) and approximate the distribution of each latent variable using the observation itself. So the use of neural networks in probabilistic reasoning comes because of its approximating capabilities. Contrary to thinking that VAEs are just probabilistic extensions of neural networks.
Similarly, other models have been developed around the same intuition I tried to explain. For example deep Kalman filters, structural VAEs, etc.
EDIT:
Note that my definition of generative model is a bit reductionist. There is a family of models called "auto-regressive" generative models that don't include a latent variable. In this case, you would be looking at a joint distribution of your variables as a factorization of the individual distributions conditional on the rest of the (previous) variables. Mathematically:
\begin{align}
p(\textbf{x}) = p(x_{0}, x_{1},...,x_{N}) \\
&= \prod_{i=0}^{N}p(x_{i}|\textbf{x}_{i<}) \\
&= p(x_{N}|x_{N-1}, x_{N-2},...x_{0})p(x_{N-1}|x_{N-2}, x_{N-3},...x_{0})...p(x_{o})
\end{align}
Best Answer
I think attention can help. Please refer to this answer.
There are many ways for you to incorporate the attention with an autoencoder. The simplest way is just to borrow the idea from BERT but make the middle layers thinner.