Solved – MCMC/EM limitations? MCMC over EM

bayesianexpectation-maximizationmarkov-chain-montecarlo

I am currently learning hierarchical Bayesian models using JAGS from R, and also pymc using Python ("Bayesian Methods for Hackers").

I can get some intuition from this post: "you will end up with a pile of numbers that looks "as if" you had somehow managed to take independent samples from the complicated distribution you wanted to know about." It is something like I can give the conditional probability, then I can generate a memoryless process based on the conditional probability. When I generate the process long enough, then the joint probability can converge.and then I can take a pile of numbers at the end of the generated sequence. It is just like I take independent samples from the complicated joint distribution. For example, I can make histogram and it can approximate the distribution function.

Then my problem is, do I need to prove whether a MCMC converges for a certain model? I am motivated to know this because I previously learned the EM algorithm for GMM and LDA (graphical models). If I can just use MCMC algorithm without proving whether it converges, then it can save much more time than EM. Since I will have to calculate the expected log likelihood function (will have to calculate posterior probability), and then maximize the expected log likelihood. It is apparently more cumbersome than the MCMC (I just need to formulate the conditional probability).

I am also wondering if the likelihood function and prior distribution are conjugate. Does it mean that the MCMC must converge? I am wondering about the limitations of MCMC and EM.

Best Answer

EM is an optimisation technique: given a likelihood with useful latent variables, it returns a local maximum, which may be a global maximum depending on the starting value.

MCMC is a simulation method: given a likelihood with or without latent variables, and a prior, it produces a sample that is approximately distributed from the posterior distribution. The first values of that sample usually depend on the starting value, which means they are often discarded as burn-in (or warm-up) stage.

When this sample is used to evaluate integrals associated with the posterior distribution [the overwhelming majority of the cases], the convergence properties are essentially the same as those of an iid Monte Carlo approximation, by virtue of the ergodic theorem.

If more is needed, i.e., a guarantee that $(x_t,\ldots,x_{t+T})$ is a sample from the posterior $\pi(x|\mathfrak{D})$, some convergence assessments techniques are available, for instance in the R package CODA. Theoretically, tools that ensure convergence are presumably beyond your reach. For instance, perfect sampling or rewewal methods.

Related Solutions

Solved – Calculating 2D Confidence Regions from MCMC Samples

I once did something like this with pymc, matplotlib, and scipy that you could adapt, the relevant code is in this gist, and the resulting plot looks like this: enter image description here

Solved – 2-Gaussian mixture model inference with MCMC and PyMC

The problem is caused by the way that PyMC draws samples for this model. As explained in section 5.8.1 of the PyMC documentation, all elements of an array variable are updated together. For small arrays like center this is not a problem, but for a large array like category it leads to a low acceptance rate. You can see the acceptance rate via

print mcmc.step_method_dict[category][0].ratio

The solution suggested in the documentation is to use an array of scalar-valued variables. In addition, you need to configure some of the proposal distributions since the default choices are bad. Here is the code that works for me:

import pymc as pm
sigmas = pm.Normal('sigmas', mu=0.1, tau=1000, size=2)
centers = pm.Normal('centers', [0.3, 0.7], [1/(0.1)**2, 1/(0.1)**2], size=2)
alpha  = pm.Beta('alpha', alpha=2, beta=3)
category = pm.Container([pm.Categorical("category%i" % i, [alpha, 1 - alpha]) 
                         for i in range(nsamples)])
observations = pm.Container([pm.Normal('samples_model%i' % i, 
                   mu=centers[category[i]], tau=1/(sigmas[category[i]]**2), 
                   value=samples[i], observed=True) for i in range(nsamples)])
model = pm.Model([observations, category, alpha, sigmas, centers])
mcmc = pm.MCMC(model)
# initialize in a good place to reduce the number of steps required
centers.value = [mu1_true, mu2_true]
# set a custom proposal for centers, since the default is bad
mcmc.use_step_method(pm.Metropolis, centers, proposal_sd=sig1_true/np.sqrt(nsamples))
# set a custom proposal for category, since the default is bad
for i in range(nsamples):
    mcmc.use_step_method(pm.DiscreteMetropolis, category[i], proposal_distribution='Prior')
mcmc.sample(100)  # beware sampling takes much longer now
# check the acceptance rates
print mcmc.step_method_dict[category[0]][0].ratio
print mcmc.step_method_dict[centers][0].ratio
print mcmc.step_method_dict[alpha][0].ratio

The proposal_sd and proposal_distribution options are explained in section 5.7.1. For the centers, I set the proposal to roughly match the standard deviation of the posterior, which is much smaller than the default due to the amount of data. PyMC does attempt to tune the width of the proposal, but this only works if your acceptance rate is sufficiently high to begin with. For category, the default proposal_distribution = 'Poisson' which gives poor results (I don't know why this is, but it certainly doesn't sound like a sensible proposal for a binary variable).

Best Answer

Related Solutions

Solved – Calculating 2D Confidence Regions from MCMC Samples

Solved – 2-Gaussian mixture model inference with MCMC and PyMC

Related Question