Solved – How do Variational Autoencoders use Negative Log Likelihood/Cross entropy on real valued outputs

machine learningvariational-bayes

When training a Variational Autoencoder, the function being maximised is the expected lower bound:

$$
\mathscr{L}(\boldsymbol{\theta}, \phi; \mathbf{x}^{(i)}) = -D_{KL}\left(q_{\phi}(\mathbf{z}|\mathbf{x}^{(i)})||p_{\boldsymbol{\theta}}(\mathbf{z})\right) + \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x}^{(i)})}\left[\log p_\boldsymbol{\theta}(\mathbf{x}^{(i)}|\mathbf{z})\right]
$$

I feel that I understand the practical calculations involved with the first term (KL divergence) reasonably well, since the original paper provides a small derivation (appendix B), which isn't too difficult to follow.

It's the second term that's giving me trouble. Firstly, it's not clear exactly what the subscript on the expectation is actually denoting in this case. Secondly, if our model is producing real values, then how do we determine $p_\boldsymbol{\theta}(\mathbf{x}^{(i)}|\mathbf{z})$ for even a single sample? Intuitively, the chance of predicting a specific sample is infinitesimally small, even if we output a probability distribution.

I've not been able to find any discussion (that I could understand as such, at least) on how this probability is determined from model outputs. But, if I look at some implementations (implementation 3, implementation 4), then I see that it's just calculated as the Negative Log Likelihood/Cross Entropy, or MSE.

How can the same expectation, given the same type of outputs (real valued; albeit not necessarily bounded the same way) have these different calculations? And how can I approach calculating second term, in general?

Additionally, $p_\boldsymbol{\theta}(\mathbf{x}^{(i)}|\mathbf{z})$ is referred to as a probability sometimes, and a "model" other times. A model is very different to a probability; how can I reconcile these two descriptions?

Best Answer

Firstly, it's not clear exactly what the subscript on the expectation is actually denoting in this case.

We can alternatively write $\int q(z|x) \log p(x|z) dz$ (dropping sub/super scripts).

Secondly, if our model is producing real values, then how do we determine $p(x|z)$ for even a single sample. Intuitively, the chance of predicting a specific sample is infinitesimally small, even if we output a probability distribution.

For continuous distributions $p(x)$ is used to refer to the probability density at some point $x$. This is different from the probability that $x$ will be drawn from that distribution (which is 0). See more info here.

I see that it's just calculated as the Negative Log Likelihood/Cross Entropy, or MSE.

In the real case, $p_\theta(x|z)$ is typically $\mathcal{N}(\mu, \sigma^2I)$ where $\mu$ and $\sigma$ are deterministic functions of $z$ (computed via a neural network). In the discrete case, $p_\theta(x|z)$ is categorical, again with parameters determined by neural network. It just happens that the log of a gaussian density evaluated at $x$ comes out to be the MSE, and the log of the categorical distribution evaluated at $x$ comes out to be cross-entropy.

A model is very different to a probability; how can I reconcile these two descriptions?

A model is a function which maps parameters $\theta$ to some probability measure over the sample space. You can think of $p_\theta$ as an instantiation of the model with parameters $\theta$, and $p_\theta(x|z)$ as the (ratio of the) probability of some event(s) in the sample space according to that model.

Related Question