Solved – Combination of variational methods and empirical Bayes

Suppose I have a posterior $p(z, \theta | y, \eta)$ with $y$ observed data, $z$ are hidden variables and $\theta$ are parameters, and $\eta$ is a vector of hyperparameters. I construct a mean field approximation to the posterior using coordinate ascent, i.e.
$$
q(z) \gets \exp\left\{E_q(\log p(z, \theta | y, \eta) | z) + \mbox{const}\right\} \\
q(\theta) \gets \exp\left\{E_q(\log p(z, \theta | y, \eta) | \theta) + \mbox{const}\right\} $$
where $\mbox{const}$ makes each integrate to $1$. Iterate until convergence. My question is, if I want to do empirical Bayes, and (approximately) profile out $\eta$, is it valid to just augment this with an EM step
$$
\eta \gets \arg \max_\eta E_q \log(p(z, \theta | y, \eta))?
$$
Based on the evidence lower bound
$$
\log p(y | \eta) \ge E_q \log(p(z, \theta | y, \eta)) – E_q \log(q(z, \theta))
$$
it seems like I should be able to get away with this; optimizing the lower bound over $\eta$ is exactly the proposed EM step, so I'm still doing coordinate ascent on the lower bound.

I suspect that, if this works, it is the obvious thing to do for people who do variational inference regularly. Since variational methods are fast, my thought is that maybe I could do this to set hyperparameters before launching into a Gibbs sampler for exact (up to MCMC error) inference.

Update: I've toyed around with this a bit and found that adding the EM step can work okay, but in some situations it seems to increase sensitivity to bad local optima in the variational algorithm. Initializations of variational parameters that normally seem to work okay don't. If I instead only throw in the EM step every few iterations it works better, but then of course I slow down convergence. One approach that works okay is to "burn in" the variational algorithm using a method that I trust, and then add the EM step every iteration thereafter. I'm adding a bounty because I'm still looking for some good general advice.

Best Answer

One way of deciding how to run variational MLE is to look at how the experts do it.

In Blei's LDA code (http://www.cs.princeton.edu/~blei/lda-c/lda-c-dist.tgz), within the "run_em" function, the "lda_inference" function (inside "doc_e_step") repeatedly maximizes with respect to each $q$ distribution until convergence. After the $q$'s converge, the algorithm maximizes with respect to the parameters in "lda_mle".

The justification for this order is that by maximizing with respect to the $q$'s until convergence you get a better estimate of the expectations of hidden variables (or marginalized parameters) needed to maximize with respect to the parameters.

In standard EM, of course, the expectations you are computing are exact - which is the main difference between standard and variational EM - so this is not a concern.

From the perspective of EM as a maximization algorithm over the function $F(q,\theta)$ (www.cs.toronto.edu/~radford/ftp/emk.pdf) or from the perspective of maximizing the evidence lower bound, it is not clear that maximizing over the q's until convergence is best in terms of computationally efficiency because the algorithm will reach a local maximum no matter the order of maximization steps.

Best Answer

Related Solutions

Related Question