Solved – Acceptance ratio in Metropolis–Hastings algorithm

markov-chain-montecarlometropolis-hastings

In the Metropolis–Hastings algorithm for sampling a target distribution, let:

  • $\pi_{i}$ be the target density at state $i$,
  • $\pi_j$ be the target density at the proposed state $j$,
  • $h_{ij}$ be the proposal density for transition to state $j$ given current state $i$,
  • $a_{ij}$ be the accept probability of proposed state $j$ given current state $i$.

Then by the detailed balance equation, after choosing the proposal density $h$, the accept probability $a$ is computed as:
$$
a_{ij} = \min(1, \frac{\pi_{j} h_{ji}}{\pi_{i} h_{ij}}).
$$

If $h$ is symmetric, i.e., $h_{ij}=h_{ji}$, then:
$$
a_{ij} = \min(1, \frac{\pi_{j}}{\pi_{i}}).
$$

When $h_i$ is a Gaussian distribution centered at state $i$ and has the same variance $\sigma^2$ for all $i$, $h$ is symmetric. From Wikipedia:

If $\sigma^2$ is too large, almost all steps under the MH algorithm will be rejected. On the other hand, if $\sigma^2$ is too small, almost all steps will be accepted.

I wonder why the accept probability changes in the reverse direction of the change of variance of the proposal density, as mentioned in the above quote?

Best Answer

In order to get this, and to simplify the matters, I always think first in just one parameter with uniform (long-range) a-priori distribution, so that in this case, the MAP estimate of the parameter is the same as the MLE. However, assume that your likelihood function is complicated enough to have several local maxima.

What MCMC does in this example in 1-D is to explore the posterior curve until it finds values of maximum probability. If the variance is too short, you'll most surely get stuck on local maxima, because you'll be always sampling values near it: the MCMC algorithm will "think" it is stuck on the target distribution. However, if the variance is too large, once you get stuck on one local maximum, you'll more-or-less reject values until you find other regions of maximum probability. If you happen to propose the value at the MAP (or a similar region of local maximum probability which is larger than the others), with a large variance you'll end up rejecting almost every other value: the difference between this region and the others will be too large.

Of course, all of the above will affect the convergence rate and not the convergence "per-se" of your chains. Recall that whatever the variance, as long as the probability of selecting the value of this global maximum region is positive, your chain will converge.

To by-pass this problem, however, what one can do is to propose different variances in a burn-in period for each parameter and aim at a certain acceptance rates which can satisfy your needs (say $0.44$, see Gelman, Roberts & Gilks, 1995 and Gelman, Gilks & Roberts, 1997 to learn more on the issue of selecting a "good" acceptance rate which, of course, will depende on the form of you posterior distribution). Of course, in this case the chain is non-markovian, so you DON'T have to use them for inference: you just use them to adjust the variance.