Solved – The meaning of convergence in Variational Inference

bayesianconvergencegraphical-modelmachine learningreferences

My friend and I are discussing about the convergence of Variational Inference, especial for Expectation Propagation method. After running some loops, the likelihood of my graphical model can be plotted like the below figure (the x axis is the number of iteration and the y axis is the likelihood value). When the likelihood is convergent(after 15 iterations), the distributions of variables in my probabilistic graphical model are also convergent.

My friend says that we can use the value of variable that maximize the likelihood. I do not agree because I think the convergent value should be chosen. If we use the value that maximize the likelihood, why do we need the convergence of inference algorithm? What is the meaning of convergence in Variational Inference? However, I do not have any theory evidence for my opinion.

Would you please give me some hints / proof / paper / book to reject or support my point?

Likelihood over the loop

Best Answer

At least for (loopy) Belief Propagation (BP), the formula used to compute the partition function only holds at the convergence point. Note that in the following I'm talking about BP and the Bethe approximation. But similar things hold for Generalized Belief Propagation (GBP) and the Kikuchi approximation. Expectation Propagation is a special case of GBP (see one of the references).

Reasoning

Yedidia showed that the stationary points of the Bethe approximation to the free energy are exactly fixed points of the BP algorithm.

The Bethe approximation is a variational approximation to the true free energy of the problem (negative logarithm of the partition function).

I suppose you use exactly this correspondence to calculate your likelihood. The formula is a sum of three terms involving entropies of variables, local entropies of factor nodes and local expectations of factor nodes. If this is the case, then it is unclear how the value you get by applying this formula to a non-converged BP-run relates to the true likelihood.

The following sentence is taken from Yedidia et al.:

Indeed, the marginalization constraints are typically not satisfied at intermediate iterations of BP; it is only at a BP fixed point that the beliefs necessarily obey all the consistency constraints.

This basically means that in general the intermediate (non-converged) states of BP do not correspond to proper beliefs and are thus some kind of garbage. (Note added: those aren't proper beliefs even when converged, see next paragraph).

Addendum

I have a bit more experience now. Before convergence, some of the constraints of the variational problem are violated. So the current message values describe a "distribution" that is even less consistent than guaranteed by the usual constraints (remember that for EP, only the expected values have to match which means we do not need to be within the marginal polytope). This means the calculated log-likelihood is even less of a true lower bound to the true log-likelihood.

So you should really take the value at convergence.

References

Probabilistc Graphical Models: Principles And Techniques by Koller and Friedman
the seminal papers by Yedidia, Freeman and Weiss. "Constructing Free-Energy Approximations and Generalized Belief Propagation Algorithms" is a good overview
Welling, Max and Minka, Thomas P and Teh, Yee Whye, Structured region graphs: Morphing EP into GBP, 2012

Best Answer

Reasoning

Addendum

References

Related Solutions

Solved – Combination of variational methods and empirical Bayes

Solved – Bayesian inferencing: how iterative parameter updates work

Related Question