Bayesian – Applying Variational Bayes on a Univariate Gaussian

bayesianvariational inferencevariational-bayes

I'm following an example from Murphy's book (Sec 21.5.1) on how to apply Variational Bayes to infer the posterior over the parameters for a 1D Gaussian $p(\mu,\lambda|\mathcal{D})$. The example uses a prior of the form
$$
p(\mu, \lambda)=p(\mu|\lambda)p(\lambda)=\mathcal{N}(\mu|\mu_0,(\kappa_0\lambda)^{-1})\mathcal{G}a(\lambda|a_0, b_0)
$$

and an approximate factored posterior of the form
$$
q(\mu, \lambda)=q_{\mu}(\mu)q_{\lambda}(\lambda)
$$

The unnormalized log posterior has the form
$$
\log[p(\mu,\lambda,\mathcal{D})]=\log[p(\mathcal{D}|\mu,\lambda)p(\mu|\lambda)p(\lambda)]
$$

Now what I don't understand is that in the following paragraph,

Updating $q_{\mu}(\mu)$:

The optimal form for $q_{\mu}(\mu)$ is obtained by averaging over
$\lambda$ : $$\begin{aligned} \log q_{\mu}(\mu)
&=\mathbb{E}_{q_{\lambda}}[\log p(\mathcal{D} \mid \mu, \lambda)+\log
p(\mu \mid \lambda)]+\text { const } \\
&=-\frac{\mathbb{E}_{q_{\lambda}}[\lambda]}{2}\left\{\kappa_{0}\left(\mu-\mu_{0}\right)^{2}+\sum_{i=1}^{N}\left(x_{i}-\mu\right)^{2}\right\}+\text { const } \end{aligned}$$

  1. why we don't have $p(\lambda)$ inside the expectation? Does that mean $\mathbb{E}_{q_{\lambda}} [\log p(\lambda)]$ is a constant? why?

  2. How $\mathbb{E}_{q_{\lambda}} [\dots]$ is different from $\mathbb{E}_{\lambda} [\dots]$ in general? i.e. How does expectation with respect to a variable ($\lambda$) differ from expectation w.r.t. a distribution ($q_{\lambda}$)?

Best Answer

  1. This is a bit tricky. Note that you are interested in $log \ q_{\mu}(\mu)$, therefore you can neglect any additive constant that does not depend on $\mu$, no matter if they contain $\lambda$ over which you are averaging. Additive constants not depending on $\mu$, after exponentiation, would just lead to a normalization factor that in this case you can control easily. These terms go therefore in the $+const$ contribution.

  2. The only difference I think is that in the first case ($E_{q_{\lambda}}$) you are explicitly saying what is the distribution of $\lambda$ over which you are averaging. Of course you cannot integrate over $\lambda$ if you do not specify a distribution but if it is clear from the context than you can use the notation $E_{\lambda}$. Anyway I think you may find many different notations going around, which does not help learning...

Related Question