Solved – How to use draws from a Markov chain (Monte Carlo) for (bayesian) inference

bayesianmarkov-chain-montecarlo

Say $X_0, X_1, …$ is a Markov chain with the state space is $S=\{0,1,2,3,…,N-1\}$ and the transition probability $P(X_{n+1}=a+1 \mod N | X_n=a)$ = $P(X_{n+1}=a \mod N | X_n=a)$ = $P(X_{n+1}=a-1 \mod N | X_n=a)$ = $1/3$. (Ie, the markov chain is a random walk on $S$ with 1/3 probability of staying put, 1/3 probability moving plus or minus 1, with wrap around at the ends.) The limiting distribution is $\pi=(1/N,1/N,…,1/N)$, the uniform dist on $S$.

As I understand it, the point of MCMC is to create markov chain such that the stationary distribution is the distribution you want to sample from. So the above chain could be used to sample from the uniform dist on $S$ (silly example, but bare with me). From my understanding, you starting with $x_0$ in $S$, 'burn-in' for some number of iterations, and subsequent iterations are draws from $\pi$.

Several questions:

First, how do you determine the length of the burn-in period? Even for the trivial example above, the time to the stationary distribution seems highly-dependent on how "mobile" the chain is (for example if $P(X_{n+1}=a \mod N | X_n=a)$ = .9, then it would take longer to spread out the distribution).

After burn-in, what role does the dependency between draws play? The literature leads me to believe that I can use draws after the burn-in period as draws from $\pi$, but clearly the individual draws are still dependent on each other according to the transition rule… if you use too few draws with 'stay-still' probability of .9 as above, that's not uniform at all.

This seems to mean that any calculation I do with draws (like approximating expectation) must be done of sets of draws greater than the burn-in period, but by how much?

Best Answer

Bayesian Statistics does not depend on Markov chains (well in theory), Markov chain Monte Carlo is a method for making the computations in Bayesian Statistics easier (doable). Generally we want to generate data from the posterior distribution which we can easily compute parts of, but not always all of (the normalizing coefficient is usually the hard part), but McMC methods will generate data from the posterior without needing to know some of the parts that are harder to find, that is the benifit of McMC with Bayesian stats (though there are some problems where the posterior is known exactly and McMC is not needed at all). It is very rare that the posterior is uniform, so your example in not realistic, but can still help with understanding.

The short answer to length of burn-in and how long to run the chain is "We Don't Know". The little longer (and more useful) answer is to run the chain for a while, then look at it to see if it looks good (yes that is subjective, but experiance helps, and if we knew the exact answer we would know enough not to need McMC), then often run longer to be sure. I have seen cases where the chain was run for quite a few iterations, it looked good, but the researchers decided to run it for 4 times longer to be sure, and near the end sudenly it jumped to a different part of the distribution that had not been covered before.

More dependency does mean that the chain needs to be run for longer (both burn-in and total runs), but as long as the chain has run long enough the dependency no longer matters. Think of a simple random sample with all values independent, now sort the values into order statistics, they are no longer independent, but the mean and sd are the same and estimate the population just as well. The fact that McMC iterations are dependent does not matter so long as there are enough iterations to fully represent the distribution. Your uniform example shows the risks from running for too short of a time.

Related Question