Variational Bayes – Why Learn $\log{\sigma^2}$ in VAE Reparameterization Trick Instead of Standard Deviation?

We know that the reparameterization trick is to learn two vectors $\sigma$ and $\mu$, sample $\epsilon$ from $N(0, 1)$ and then your latent vector $Z$ would be (where $\odot$ is the element-wise product.):
$$
Z = \mu + \sigma\odot\epsilon
$$

However, in the TensorFlow TensorFlow tutorial code for VAEs, it is learning $\log{\sigma^2}$ and then transform it to $\sigma$ with Exp. The code is here:

  def reparameterize(self, mean, logvar):
    eps = tf.random.normal(shape=mean.shape)
    return eps * tf.exp(logvar * .5) + mean

which is showing this:
$$
Z = \mu + \epsilon\times e^{0.5\times\log{\sigma^2}} = \mu + \epsilon\times e^{\log{\sqrt{\sigma^2}}} = \mu + \epsilon\times \sigma
$$

I know that we learn $\log{\sigma^2}$ instead of $\sigma^2$ because the variance of a random variable is constrained to be positive (i.e. $\sigma^2 \in \mathbb{R}^+$) and so if we were to try to learn the variance we would have to constrain somehow the output of a neural network to be positive. A simple way around this is to learn the logarithm instead since $\log(\sigma^2) \in \mathbb{R}$ ensures that $\exp(\log(\sigma^2)) \in \mathbb{R}^+$ (thanks to this answer).

BUT I don't understand why don't we just learn $\log{\sigma}$? why do we learn variance instead of standards deviation?

Best Answer

It doesn't make any real difference; since $\log \sigma^2= 2\log \sigma$; learning one is as easy as learning the other
It's traditional in statistics to think of $\sigma^2$ as the second parameter of a Normal distribution (rather than $\sigma$).
There's a simple unbiased estimator for $\sigma^2$ but not for $\sigma$
The math for representing the Normal as a two-parameter exponential family is slightly simpler as $(\mu, \sigma^2)$ than $(\mu, \sigma)$

None of this is really important; it's mostly tradition.

Best Answer

Related Solutions

Solved – How does the reparameterization trick for VAEs work and why is it important

Neural Networks – Why in Variational Auto Encoder (Gaussian variational family) model $\log\sigma^2$ not $\sigma^2$ or $\sigma$?

Related Question