Solved – MCMC/EM limitations? MCMC over EM

bayesianexpectation-maximizationmarkov-chain-montecarlo

I am currently learning hierarchical Bayesian models using JAGS from R, and also pymc using Python ("Bayesian Methods for Hackers").

I can get some intuition from this post: "you will end up with a pile of numbers that looks "as if" you had somehow managed to take independent samples from the complicated distribution you wanted to know about." It is something like I can give the conditional probability, then I can generate a memoryless process based on the conditional probability. When I generate the process long enough, then the joint probability can converge.and then I can take a pile of numbers at the end of the generated sequence. It is just like I take independent samples from the complicated joint distribution. For example, I can make histogram and it can approximate the distribution function.

Then my problem is, do I need to prove whether a MCMC converges for a certain model? I am motivated to know this because I previously learned the EM algorithm for GMM and LDA (graphical models). If I can just use MCMC algorithm without proving whether it converges, then it can save much more time than EM. Since I will have to calculate the expected log likelihood function (will have to calculate posterior probability), and then maximize the expected log likelihood. It is apparently more cumbersome than the MCMC (I just need to formulate the conditional probability).

I am also wondering if the likelihood function and prior distribution are conjugate. Does it mean that the MCMC must converge? I am wondering about the limitations of MCMC and EM.

Best Answer

EM is an optimisation technique: given a likelihood with useful latent variables, it returns a local maximum, which may be a global maximum depending on the starting value.

MCMC is a simulation method: given a likelihood with or without latent variables, and a prior, it produces a sample that is approximately distributed from the posterior distribution. The first values of that sample usually depend on the starting value, which means they are often discarded as burn-in (or warm-up) stage.

When this sample is used to evaluate integrals associated with the posterior distribution [the overwhelming majority of the cases], the convergence properties are essentially the same as those of an iid Monte Carlo approximation, by virtue of the ergodic theorem.

If more is needed, i.e., a guarantee that $(x_t,\ldots,x_{t+T})$ is a sample from the posterior $\pi(x|\mathfrak{D})$, some convergence assessments techniques are available, for instance in the R package CODA. Theoretically, tools that ensure convergence are presumably beyond your reach. For instance, perfect sampling or rewewal methods.

Related Question