Machine Learning – Why Decode Samples from N(0,1) Instead of ? + ?N(0,1) in Variational Autoencoders?

Context: I'm trying to understand the use of variational autoencoders as generators. My understanding:

During training, for an input point $x_i$ we want to learn latent $\mu_i$ and $\sigma_i$ and then sample $z_i \sim N(\mu_i, \sigma_i)$ and feed it to the decoder to get a reconstruction $\hat{x}_i = \text{decode}(z_i)$.
But we can't do back propagation with sampling operator, so instead we reparametrize and use $z_i = \mu_i + \sigma_i \epsilon$ where $\epsilon \sim N(0, 1)$. Our reconstruction becomes $\hat{x}_i = \text{decode}(\mu_i + \sigma_i \epsilon)$.

However when we're done with training and ready to use it as generator, we sample $z \sim N(0, 1)$ and feed it to decoder: $x_{sample} = \text{decode}(z)$ .

The part that confuse me is that during training, the decode operation was done using $\mu_i + \sigma_i \epsilon$ which to my understanding this is using $N(\mu_i, \sigma_i)$ with different $\mu_i$ and $\sigma_i$ for each training example. However during the generation time, the decode operation is done (effectively) on $\epsilon$ alone from $N(0, 1)$. Why are we setting $\mu = 0$ and $\sigma = 1$ during generation (i.e. using $z = 0 + 1 \cdot \epsilon$)?

Best Answer

During training, we are drawing $z \sim P(z|x)$, and then decoding with $\hat x = g(z)$.

During generation, we are drawing $z \sim P(z)$, and then decoding $x = g(z)$.

So this answers your question: during generation, we want to generate samples from the prior distribution of latent codes, whereas during training, we are drawing samples from the posterior distribution, because we are trying to reconstruct a specific datapoint.

Best Answer

Related Solutions

Neural Networks – How to Derive MSE in Loss Function for Variational Autoencoder

Related Question