Neural Networks – How to Derive MSE in Loss Function for Variational Autoencoder

autoencodersloss-functionsneural networksvariational-bayes

Context: https://arxiv.org/pdf/1312.6114.pdf

So if I start with this equation:
$$
\mathcal{L}\left(\boldsymbol{\theta}, \boldsymbol{\phi} ; \mathbf{x}^{(i)}\right) \simeq \frac{1}{2} \sum_{j=1}^{J}\left(1+\log \left(\left(\sigma_{j}^{(i)}\right)^{2}\right)-\left(\mu_{j}^{(i)}\right)^{2}-\left(\sigma_{j}^{(i)}\right)^{2}\right)+\frac{1}{L} \sum_{l=1}^{L} \log p_{\boldsymbol{\theta}}\left(\mathbf{x}^{(i)} \mid \mathbf{z}^{(i, l)}\right)
$$

where $\mathbf{z}^{(i, l)}=\boldsymbol{\mu}^{(i)}+\boldsymbol{\sigma}^{(i)} \odot \boldsymbol{\epsilon}^{(l)} \quad$ and $\quad \boldsymbol{\epsilon}^{(l)} \sim \mathcal{N}(0, \mathbf{I})$

So I'm looking at this part of the equation in particular:
$$\frac{1}{L}\sum_{l=1}^{L} \log p_{\boldsymbol{\theta}}\left(\mathbf{x}^{(i)} \mid \mathbf{z}^{(i, l)}\right)
$$

And I'm looking at it where the decoder is a multivariate Gaussian with a diagonal covariance structure:
$$
\begin{aligned}
\log p(\mathbf{x} \mid \mathbf{z}) &=\log \mathcal{N}\left(\mathbf{x} ; \boldsymbol{\mu}, \boldsymbol{\sigma}^{2} \mathbf{I}\right) \\
\end{aligned}
$$

In this stackexchange discussion,
https://ai.stackexchange.com/questions/27341/in-variational-autoencoders-why-do-people-use-mse-for-the-loss, a few answers and comments talk about how manipulating this particular log(p(x|z)) bgets something that resembles the MSE.

So, if you are trying to predict e.g. floating-point numbers (in the case of images, these can be the RGB values in the range $[0,1])$, then you can assume $p_{\boldsymbol{\theta}}(\mathbf{x} \mid \mathbf{z})$ is a Gaussian, then you can equivalently minimise the MSE between the prediction of the decoder and the real image in order to maximise the likelihood. You can easily show this: just replace $p_{\boldsymbol{\theta}}(\mathbf{x} \mid \mathbf{z})$ with the Gaussian $\mathrm{pdf}_{1}$ then maximise that wrt the parameters, and you should end up with something that resembles the MSE.

And similarily in this discussion, Loss function autoencoder vs variational-autoencoder or MSE-loss vs binary-cross-entropy-loss, it is mentioned that

If you assume it follows a normal distribution you will end up with a MSE minimization since $p(x \mid z)$ can be reformulated as $p(x \mid \hat{x}) \sim \mathcal{N}(\hat{x}, \sigma)$

My question is, how do we show that this is true?

This is my attempt:

$$C= \mathbf{\sigma^2 I}$$
is the covariance matrix, and $$\sigma_i^2$$ is the ith diagonal values of this marix;

So then $$N(\textbf{x};\mathbf{\mu},\textbf{C})
=\frac{1}{(2 \pi)^{L/2} \sqrt{det(\textbf{C})}}exp(\frac{-1}{2} (\textbf{x}- \mathbf{\mu})^T C^{-1}(\textbf{x}-\mathbf{\mu}))$$

$$det(\textbf{C}) =\prod{\sigma_i^2}$$

$$ log(N(\textbf{x};\mathbf{\mu},\textbf{C}))=\frac{-L}{2} log(2\pi) + \frac{-1}{2} log(\prod{\sigma_i^2} ) + \sum{ \frac{(x_i-\mu_i)^2}{-2 \sigma_i^2}}$$

$$= \frac{-L}{2} log(2\pi) + \frac{-1}{2} \sum{\sigma_i^2} + \sum{ \frac{(x_i-\mu_i)^2}{-2 \sigma_i^2}}
$$

And from this point, I am not sure what to do next.

Best Answer

Starting from $$ \frac{-L}{2} log(2\pi) + \frac{-1}{2} \sum{\sigma_i^2} + \sum{ \frac{(x_i-\mu_i)^2}{-2 \sigma_i^2}} $$

$L$, the dimension, is a fixed quantity. $\sigma$ is also a fixed quantity, with $\sigma_i = \sigma_j$ for all $i,j$ (you can choose to treat it as a variable, but this is rarely done, and doesn't lead to MSE). So we can drop the first two terms, since they are constant, and we're left with

$$ -\frac{1}{2\sigma^2}\sum_i{ (x_i-\mu_i)^2} $$

Maximizing this quantity is equivalent to minimizing $\sum_i (x_i-\mu_i)^2$, which is MSE.