Solved – Expected value of logarithm of distribution

distributionsexpected valuevariational-bayes

I'm studying variational inference and I'm checking an example from Bishop, Pattern recognition and machine learning (2006).

In the page 470 (10.1.3) there is an example with the univariate Gaussian.

There is a dataset $ \mathcal{D} = \{x_1, \ldots, x_N \} $ of observed values which are drawn from a Gaussian (independently) so the likelihood is:

$$ p(\mathcal{D}\mid \mu,\tau)= \left( \frac{\tau}{2\pi}\right)^{(N/2)} \exp \left\{-\frac{\tau}{2} \sum_{n=1}^N{(x_n – \mu)^2)} \right\}$$

and the priors are:

$$ p(\mu\mid\tau) = \mathcal{N}(\mu \mid \mu_0 \mid (\lambda_0\tau)^{-1}) $$

$$ p(\tau) = \operatorname{Gam}(\tau \mid a_0, b_0)$$

we try to aproximate the posterior distribution using a factorized variational aproximation $ q(\mu,\tau) = q_\mu(\mu)q_\tau(\tau)$

The optimum factors can be obtained from the general result of VI (maximizing the lower bound) so we have for the (log of) optimum factor

$ \ln q_\mu^*(\mu) = \DeclareMathOperator{\E}{\mathbb{E}} \E_\tau[ \ln p(\mathcal{D}\mid \mu,\tau) + \ln p(\mu\mid\tau)] + \text{const} \tag{1}$

$$ \ln q_\mu^*(\mu) = -\frac{\E[\tau]}{2} \{\lambda_0(\mu-\mu_0)^2 + \sum_{n=1}^N{(x_n – \mu)^2)}\} + \text{const} \tag{2}$$

Then we can see that that is a Gaussian $ \mathcal{N}(\mu \mid \mu_N \mid \lambda_N^-1) $

with

$$ \mu_N= \frac{\lambda_0\mu_0+N\overline{x}}{\lambda_0 + N}$$

$$\lambda_N = (\lambda_0 + N) \E[\tau]$$

My question is how do I go from $(1)$ to $(2)$ should I just apply the definition of Expected value? but what if is not the distribution but the logarithm of the distribution.

Best Answer

Keep in mind that expectation is a linear operator. For a continuous variable $z$, $\mathbb{E}_z [f(z)] = \int f(z) p(z) dz.$ So it has the distributive and associative properties. That is $\mathbb{E}_z [k f(z) + c g(z)] = k \mathbb{E}_z[f(z)] + c \mathbb{E}_z[g(z)].$ So we can do the two expectations separately.

Let's look at the first one.

$$ \begin{split} \mathbb{E}_\tau[\ln p(\mathcal{D} \mid \mu, \tau)] &= \frac{N}{2} \left( \mathbb{E}_\tau[\ln \tau] - \ln 2 \pi \right) - \frac{1}{2} \sum_{n=1}^N (x_n - \mu)^2 \mathbb{E}_\tau[\tau] \\ &= -\frac{\mathbb{E}_\tau[\tau]}{2} \sum_{n=1}^N (x_n-\mu)^2 + \text{const} \end{split} $$ where I just used the distributive and associative properties as above and removed any constant of $\mu$ since all you want at this phase is a non-normalized distribution over $\mu.$ It should be easy for you to take the expectation of the second term, $\ln p(\mu \mid \tau).$

As for your second question. The general result of Variational Inference that you're using is (from Bishop) $$ \ln q_j^*({\bf Z}_j) = \mathbb{E}_{i \ne j} [\ln p({\bf X}, {\bf Z})] + \text{const.} $$ That is, if you partitioned the latent variables $\bf Z$ into $K$ components ${\bf Z}_i, \ i \in \{1, 2, \dots, K\}$ each with its own distribution $q_i({\bf Z}_i)$ then, for every $j$ the best estimate, $q_j^*({\bf Z}_j),$ is given above, where the expectation is taken for all $i$ except $j$. This result is given explicitly in terms of the log of the distributions, so you always want to deal with the log when using it. If you have the regular distribution, take the log of it and then apply the VI result above.

Related Question