How come it's possible to get Metropolis-Hastings acceptance rates close to 1 (for example, when exploring a unimodal distribution with a normal proposal distribution with too-small SD), after burn-in is over? I see it in my own MCMC chains but I don't understand how it makes sense. It seems to me that after reaching the summit acceptance rate should stabilize around values that are smaller than 0.5.
Markov Chain Monte Carlo – Acceptance Rate for Metropolis-Hastings > 0.5
markov-chain-montecarlometropolis-hastings
Related Solutions
I believe that Weak convergence and optimal scaling of random walk Metropolis algorithms by Roberts, Gelman and Gilks is the source for the 0.234 optimal acceptance rate.
What the paper shows is that, under certain assumptions, you can scale the random walk Metropolis-Hastings algorithm as the dimension of the space goes to infinity to get a limiting diffusion for each coordinate. In the limit, the diffusion can be seen as "most efficient" if the acceptance rate takes the value 0.234. Intuitively, it is a tradeoff between making to many small accepted steps and making to many large proposals that get rejected.
The Metropolis-Hastings algorithm is not really an optimization algorithm, in contrast to simulated annealing. It is an algorithm that is supposed to simulate from the target distribution, hence the acceptance probability should not be driven towards 0.
Here you go - three examples. I've made the code much less efficient than it would be in a real application in order to make the logic clearer (I hope.)
# We'll assume estimation of a Poisson mean as a function of x
x <- runif(100)
y <- rpois(100,5*x) # beta = 5 where mean(y[i]) = beta*x[i]
# Prior distribution on log(beta): t(5) with mean 2
# (Very spread out on original scale; median = 7.4, roughly)
log_prior <- function(log_beta) dt(log_beta-2, 5, log=TRUE)
# Log likelihood
log_lik <- function(log_beta, y, x) sum(dpois(y, exp(log_beta)*x, log=TRUE))
# Random Walk Metropolis-Hastings
# Proposal is centered at the current value of the parameter
rw_proposal <- function(current) rnorm(1, current, 0.25)
rw_p_proposal_given_current <- function(proposal, current) dnorm(proposal, current, 0.25, log=TRUE)
rw_p_current_given_proposal <- function(current, proposal) dnorm(current, proposal, 0.25, log=TRUE)
rw_alpha <- function(proposal, current) {
# Due to the structure of the rw proposal distribution, the rw_p_proposal_given_current and
# rw_p_current_given_proposal terms cancel out, so we don't need to include them - although
# logically they are still there: p(prop|curr) = p(curr|prop) for all curr, prop
exp(log_lik(proposal, y, x) + log_prior(proposal) - log_lik(current, y, x) - log_prior(current))
}
# Independent Metropolis-Hastings
# Note: the proposal is independent of the current value (hence the name), but I maintain the
# parameterization of the functions anyway. The proposal is not ignorable any more
# when calculation the acceptance probability, as p(curr|prop) != p(prop|curr) in general.
ind_proposal <- function(current) rnorm(1, 2, 1)
ind_p_proposal_given_current <- function(proposal, current) dnorm(proposal, 2, 1, log=TRUE)
ind_p_current_given_proposal <- function(current, proposal) dnorm(current, 2, 1, log=TRUE)
ind_alpha <- function(proposal, current) {
exp(log_lik(proposal, y, x) + log_prior(proposal) + ind_p_current_given_proposal(current, proposal)
- log_lik(current, y, x) - log_prior(current) - ind_p_proposal_given_current(proposal, current))
}
# Vanilla Metropolis-Hastings - the independence sampler would do here, but I'll add something
# else for the proposal distribution; a Normal(current, 0.1+abs(current)/5) - symmetric but with a different
# scale depending upon location, so can't ignore the proposal distribution when calculating alpha as
# p(prop|curr) != p(curr|prop) in general
van_proposal <- function(current) rnorm(1, current, 0.1+abs(current)/5)
van_p_proposal_given_current <- function(proposal, current) dnorm(proposal, current, 0.1+abs(current)/5, log=TRUE)
van_p_current_given_proposal <- function(current, proposal) dnorm(current, proposal, 0.1+abs(proposal)/5, log=TRUE)
van_alpha <- function(proposal, current) {
exp(log_lik(proposal, y, x) + log_prior(proposal) + ind_p_current_given_proposal(current, proposal)
- log_lik(current, y, x) - log_prior(current) - ind_p_proposal_given_current(proposal, current))
}
# Generate the chain
values <- rep(0, 10000)
u <- runif(length(values))
naccept <- 0
current <- 1 # Initial value
propfunc <- van_proposal # Substitute ind_proposal or rw_proposal here
alphafunc <- van_alpha # Substitute ind_alpha or rw_alpha here
for (i in 1:length(values)) {
proposal <- propfunc(current)
alpha <- alphafunc(proposal, current)
if (u[i] < alpha) {
values[i] <- exp(proposal)
current <- proposal
naccept <- naccept + 1
} else {
values[i] <- exp(current)
}
}
naccept / length(values)
summary(values)
For the vanilla sampler, we get:
> naccept / length(values)
[1] 0.1737
> summary(values)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.843 5.153 5.388 5.378 5.594 6.628
which is a low acceptance probability, but still... tuning the proposal would help here, or adopting a different one. Here's the random walk proposal results:
> naccept / length(values)
[1] 0.2902
> summary(values)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.718 5.147 5.369 5.370 5.584 6.781
Similar results, as one would hope, and a better acceptance probability (aiming for ~50% with one parameter.)
And, for completeness, the independence sampler:
> naccept / length(values)
[1] 0.0684
> summary(values)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.990 5.162 5.391 5.380 5.577 8.802
Because it doesn't "adapt" to the shape of the posterior, it tends to have the poorest acceptance probability and is hardest to tune well for this problem.
Note that generally speaking we'd prefer proposals with fatter tails, but that's a whole other topic.
Best Answer
The acceptance rate depends largely on the proposal distribution. If it has small variance, the ratio of the probabilities between the current point and the proposal will necessarily always be close to 1, giving a high acceptance chance. This is just because the target probability densities we typically work with are locally Lipschitz (a type of smoothness) at small scales, so the probability of two nearby points is similar (informally).
If your current sample is close to the MAP value, the proposals will have less than one acceptance probability, but it can still be very close to 1.
As a side note, standard practice is to tune the proposal distribution to get around a 0.2-0.25 acceptance rate. See here for a discussion of this.