Solved – burn in for Metropolis Hastings MCMC

bayesianmarkov-chain-montecarlometropolis-hastingsposteriorsampling

I was wondering if there is a principled way to figure out how many samples to discard during the MH-MCMC burn-in stage. So, as I understand it, the initial samples can introduce bias in the computation of expectation in the estimated posterior distribution especially if they are really far from the mean.

I was wondering, say I do a MAP estimation and try to find some local mode of the distribution to initialise my MH-MCMC. Would that mean I do not have to do the burn in (assuming this mode is in some high density region of the distribution)?

Best Answer

..have to do the burn in..?

As pointed out in the comments, taking out some of the initial samples is not a mathematical necessity. We take out burn in because it is a computational statistics hack/cheat to avoid having to get more samples to correct the bias of the initial samples. In the small samples case, the bias could be somewhat noticeable. In the many samples case, this bias is minimal.. but if you do not want to take a lot of samples and your sampler takes time to converge* then it is a good idea to throw some initial samples out.

..a MAP estimation and try to find some local mode of the distribution to initialise my MH-MCMC

This is perfectly OK, and has been used/recommended in cases where the network is complex enough that convergence may take an obnoxious amount of time to reach. For example, consider a finite mixture of $K$ Normals with latent mixture labels: $$ y_i \mid z_i = k \sim \mathcal{N}(\mu_k, \sigma^2_k)\\ z_i \mid \boldsymbol \pi \sim \text{Cat}_K(\boldsymbol \pi) \\ \boldsymbol \pi \sim \text{Dir}_K(\dots) \\ \mu_k, \sigma^2_k \sim \mathcal{N-}\Gamma^{-1}(\dots) $$

Let's pretend that a sampler based on this model would take a long amount of time to converge and let's also ignore this model's identifiability issues.. One possible way to initialize the data would be to identify 'humps' in your histogram and place appropriately scaled normal distributions in these locations. That is, your initialization of these parameters would be pseudo-empirical. Then you might take each datum $y_i$ and assign its corresponding $z_i$ to be the label which corresponds to the Normal parameters $y_i$ has the highest likelihood under. Then you might initialize $\boldsymbol \pi$ as the proportion vector of all the $z_i$ belonging to each of the $K$ labels.

If you have a good eye, the sampler with these initializations will not budge far from where you initialized the values. If you looked at trace plots there would be no reason to think that the convergence was not immediate, and so there would be almost no benefit from taking out burn in.

We can do this because there are no restrictions on initializations in MCMC. Initializations do not affect the stationary distribution, so there is no 'crime' (e.g., empirical Bayes) in using the data to inform our initializations, as compared to, say, the prior.

With that said, I have only once been, and I would hope it would be rare to be, in a situation where implementing a MAP estimate algorithm was noticeably faster than taking some extra samples, in either run time or developer time. I suppose it would be context-specific, data-specific, MAP-algorithm-specific, and sampler-specific.


*Of course it is always worth asking in these cases: did my sampler actually converge?

Related Question