I have been reading the original VAE paper,Auto-Encoding Variational Bayes. In VAE, when generating samples, why do we sample from prior instead of the learned variational posterior(Fig 5 in the paper)? Similarly when visualizing learned data manifold why do we take inverse CDF of prior(multivariate unit Gaussian) instead of variational posterior(Fig 4)?
Solved – Variational Auto Encoder (VAE) sampling from prior vs posterior
neural networksvariational-bayes
Related Solutions
Good priors depend on your actual problem - in particular, I don't believe there are any truly universal defaults. One good way is to try to formulate (possibly weak and vague) domain-specific knowledge about the process that generated your data, e.g.:
- "It's highly unlikely to have more than 12 components"
- "It's highly unlikely to observe values larger than 80"
Note that those should not generally be informed by the actual data you collected but by what you would be able to say before gathering the data. (e.g. the data represent outdoor temperatures in Celsius therefore they will very likely lie in $[-50,80]$ even before looking at data). It is also OK to motivate your priors by the computational machinery you use (e.g. I will collect 100 datapoints, hence I can safely assume it is unlikely to have more than 10 components since I won't have enough data to locate more components anyway)
Some of those statements can be translated directly into priors - e.g. you can set $m_0$ and $W_0^{-1}$ so that 95% of the prior mass is over the expected range of values.
For the less intuitive parameters (or just as another robustness check), you can follow the Visualization in Bayesian workflow paper and do prior predictive checks: this means that you simulate a large number of new datasets starting from your prior. You can then visualize them to see if they
- don't violate your expectations too often (it is good to leave some room for surprises, hence aiming for something like 90% or 95% of simulations within your constraints)
- otherwise cover the whole spectrum of values reasonably well
Because VAEs are really a graphical model more than they are autoencoders, it can be confusing what exactly "reconstruction" means in context.
Recall that we have an lower bound on the log probability given by the RHS of:
$$\log p(x) - D_{KL}(Q(z|x)||P(z|x)) = E_{z\sim Q}[\log p(x|z)] - D_{KL}(Q(z|x)||P(z))$$
To compute this lower bound -- a necessary prerequisite for doing a backprop pass to maximize it -- corresponds to passing $x$ through the encoder to arrive at $Q(z|x)$, computing the KL-divergence, then estimating $E_{z \sim Q}[\log p(x|z)]$ by sampling once or more (but usually just once) from $Q$ and running the decoder.
This process of estimating the posterior with the encoder and then sampling to approximate the expectation in the RHS so closely mimics the computation of an autoencoder would do that we call it "reconstruction". However, it's really just a side effect of trying to maximize the log probability of the inputs.
What happens when you sample multiple times from $Q$? The immediate consequence is that you get a better approximation of the expectation, and hence a better approximation of the lower bound on the log probability. You also need to run the decoder multiple times, which can be expensive, so it is usually not done. Of course if you do this, then you end up with many reconstructions rather than just one. Note that it is definitely not possible to average the reconstructions and have a meaningful output.
So you probably just want to sample once.
In response to your edit, the correct way to write it would be
$$\begin{align*} E_{z \sim Q}[\log p(x|z)] &\approx \frac{1}{n}\sum_{i=1}^N \log p(x|z_i) \\ &\propto -\frac{1}{n}\sum_i ||x-\text{decode}(z_i)||_2^2 \\ &= -\frac{1}{nm} \sum_i \sum_j (x_j - \text{decode}(z_i)_j)^2\end{align*}$$
We would expect that the reconstructions $\text{decode}(z_i)$ look quite similar to each other, but not exactly the same. Exactly how well depends on the nature of the data and how well the model is fitted.
Best Answer
Here's what I understood about VAEs:
The posterior explains how likely the latent variable is given the input, while the prior simply represents how the latent variables are distributed without any conditioning (in CVAEs conditions are added here as well).
Hence, in training, we want to learn a good posterior approximation (Evidence) that explains the input, but in testing we want to generate random samples following the prior distribution (unless you want to condition them some how).