Solved – On evaluating variational autoencoders with prior likelihood and reconstruction error

autoencodersgenerative-modelsmachine learningneural networksvariational-bayes

A common evaluation metric for variational autoencoders (VAEs) is estimating the marginal likelihood of some held-out data, i.e. $p(x)$. This is difficult and often one can only get a lower bound. It's also complicated when using some reconstruction loss $R$ as a log-likelihood for some complex data-type. Balancing the KL term and $R$ is itself a challenge, and this issue is seemingly built into the use of $p(x)$.
Theis et al, A Note on the Evaluation of Generative Models note some of these difficulties.

However, we can easily measure the likelihood of the generative model sampling the encoding of given data point $x$: it's just $p_z( E(x) )$, where $E$ is the encoder, and $p_z$ is the latent prior. However, this does not tell us if the output is good, i.e. if $\hat{x}=D(x)$ is close to $x$. In other words, we might be able to sample the encoded latent form of $x$, but this does not mean the reconstruction will be good. But we can use $R$ for that! In other words, if there is a good chance of sampling $E(x)$ under the prior AND the reconstruction performance is good, then (seemingly) we can say that the generative model is good.

So why not report the two values: the average $R(x)$ and $\log p_z(E(x))$ over the held-out set?
Are there any papers that do this?

Best Answer

Well, $p_z(E(x))$ doesn't seem to be the best metric as it can be maximized with with a encoder network which disregards the standard normal prior and always outputs 0 (or some very small values close to 0, so that the decoder can do its job).

More broadly, when comparing VAEs to other generative models (RBMs, PixelRNN, PixelCNN, GLOW, NICE, many GAN variants, etc) which don't have any autoencoder structure and don't have any concept of "reconstruction loss" or "likelihood of the encoding of a data point", it doesn't make sense to use either of those measures. You can only really rely on 1. the subjective quality of the generated samples, and 2. likelihood of the model (although GANs don't allow this, and with VAEs it can be hard)

Related Question