Solved – disadvantages variational inference

boundsmachine learningmathematical-statistics

A lot of methods utilize variational inference for hyperparameter calculation.

What are the advantages and disadvantages of variational inference ?

(ex: does it guarantee a global optimal ? )

Best Answer

Further disadvantages:

The outcome tends to depend heavily on the starting point for the optimization. Example: this paper which is heavily cited but known to have severe problems (software packages based on it were later withdrawn, etc.)
The calculations required to figure out what you are optimizing are often very complicated. (See any paper on variational inference.)

On the plus side, there is an excellent introduction to the subject in Mackay's textbook Information Theory, Inference and Learning Algorithms.

Reasoning

Yedidia showed that the stationary points of the Bethe approximation to the free energy are exactly fixed points of the BP algorithm.

The Bethe approximation is a variational approximation to the true free energy of the problem (negative logarithm of the partition function).

I suppose you use exactly this correspondence to calculate your likelihood. The formula is a sum of three terms involving entropies of variables, local entropies of factor nodes and local expectations of factor nodes. If this is the case, then it is unclear how the value you get by applying this formula to a non-converged BP-run relates to the true likelihood.

The following sentence is taken from Yedidia et al.:

Indeed, the marginalization constraints are typically not satisfied at intermediate iterations of BP; it is only at a BP fixed point that the beliefs necessarily obey all the consistency constraints.

This basically means that in general the intermediate (non-converged) states of BP do not correspond to proper beliefs and are thus some kind of garbage. (Note added: those aren't proper beliefs even when converged, see next paragraph).

Addendum

I have a bit more experience now. Before convergence, some of the constraints of the variational problem are violated. So the current message values describe a "distribution" that is even less consistent than guaranteed by the usual constraints (remember that for EP, only the expected values have to match which means we do not need to be within the marginal polytope). This means the calculated log-likelihood is even less of a true lower bound to the true log-likelihood.

So you should really take the value at convergence.

References

Probabilistc Graphical Models: Principles And Techniques by Koller and Friedman
the seminal papers by Yedidia, Freeman and Weiss. "Constructing Free-Energy Approximations and Generalized Belief Propagation Algorithms" is a good overview
Welling, Max and Minka, Thomas P and Teh, Yee Whye, Structured region graphs: Morphing EP into GBP, 2012

Solved – Advantages and disadvantages of using population and samples for statistics

Population:

Advantages:

No need for sampling! (The entire population is in your dataset)
Your findings would be representative of the population (since your analyses are based on the population).

Disadvantages:

To collect all of the information for a population it would likely take a great deal of time, which means more effort and money.

Sample:

Advantages:

Usually the only option - it would be a rare scenario to have data on an ENTIRE population.
You can make reliable estimates of the population with less time, effort, and money. (If the sample is representative of the population of interest)

Disadvantages:

None, as long as the sample is representative of the population of interest. Otherwise, bias aplenty! And bias means that you will have some explaining to do!

As you can see, the advantages of the population are in regards to having all the information that you would ever want; however, sampling is usually the reality as a function of time, effort, and most importantly, money. By ensuring adequate sampling from the population of interest, a representative sample can be extracted to perform analyses and identify findings that one would uncover while looking at the entire population at a fraction of the cost.