Solved – Help Understanding Reconstruction Loss In Variational Autoencoder

autoencodersneural networks

The reconstruction loss for a VAE (see, for example equation 20.77 in The Deep Learning Book) is often written as

$-\mathbb{E}_{z\sim{q(z | x)}} log(p_{model}(x | z))$,

where $z$ represents latent variables, $x$ represents an image, and $q$ is an approximate encoder. If we assume that

$x | z \sim \mathcal{N}(\mu_{x|z}, \Sigma_{x|z})$,

then I understand how that minimizing reconstruction loss is just a typical maximum likelihood problem.

But when I see VAEs implemented in practice, people often seem to represent the reconstruction loss as the L2 loss between each $x$ (training image) and $\hat{x}$ (decoded image).

Can anyone explain how to translate the reconstruction loss in the first expression above to L2 loss between $x$ and $\hat{x}$?

Best Answer

Typically in VAE implementations, the output of the decoder is actually the mean $\mu_{x|z}$ which I will just call $\mu$, and people assumes a unitary covariance. So in that case we have: $logP(x|z)=-\frac{1}{2}[log(|\Sigma|)+klog(2\pi)+(\mathbf{x}-\mathbf{\mu})^T(\mathbf{x}-\mathbf{\mu})]$

This comes from taking the log of the pdf of a multivariate Gaussian distribution. Now you can see that since the first two terms are constant with respect to $\mu$, the optimization problem is equivalent to maximize $-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T(\mathbf{x}-\boldsymbol{\mu})$ which is the just the L2 loss between $\mathbf{x}$ and $\boldsymbol{\mu}$. Finally the expectation is just approximated by averaging.