Markov Chain – Dynamically Adjusting Parameters of Markov Chain

autocorrelationmarkov-chain-montecarlomarkov-processmetropolis-hastings

I am using a Metropolis algorithm to generate samples from a complicated (high-dimensional) probability distribution. As is common, the proposed updates depend on some "step size" parameter $\epsilon$, such that

too small $\epsilon$ leads to very small updates and thus long auto-correlations
too large $\epsilon$ leads to small acceptance-rates and thus long auto-correlations.

Common wisdom is that one should choose $\epsilon$ such that the overall acceptance probability is somewhere in the 50%-90% range in order to optimize for autocorrelation times. Usually this tuning is done by hand, i.e. using a couple of trial runs.

No the question: Is it valid to adjust $\epsilon$ dynamically during the Markov process? By "valid" I mean, is the stationary distribution of the process still the same?

For example I could track the acceptance-rate so far, and:

If the rate is $<0.9$, decrease $\epsilon$ a little
If the rate is $>0.9$, increase $\epsilon$ a little

I could do this automatic adjustment after every step. I dont see why this would be wrong (after all, the stationary distribution does not depend on $\epsilon$). But it seems very fishy to me to make the simulation parameters dependent on the simulation results.

Could someone help me to wrap my head around this?

Best Answer

First, the acceptance rate for random walk Metropolis-Hastings and MALA should be lower than 50%. For instance, to quote from an answer to an earlier X validated question,

I believe that Weak convergence and optimal scaling of random walk Metropolis algorithms by Roberts, Gelman and Gilks is the source for the 0.234 optimal acceptance rate.

What the paper shows is that, under certain assumptions, you can scale the random walk Metropolis-Hastings algorithm as the dimension of the space goes to infinity to get a limiting diffusion for each coordinate. In the limit, the diffusion can be seen as "most efficient" if the acceptance rate takes the value 0.234. Intuitively, it is a tradeoff between making to many small accepted steps and making to many large proposals that get rejected.

Second, to keep tuning $\varepsilon$ according to earlier runs invalidates the Markov justification of MCMC algorithms unless the tuning is controlled, for instance under diminishing adaptation. Entries on X Validated with the keywords adaptive and metropolis provide some references on the topic.

Related Solutions

Estimation – How to Handle Non-Convergence in Markov Chain Monte Carlo

It is hard to tell if your problem is due to mistakes inyour implementation or if it is about the actual estimation problem. We don't have any working code to inspect. I implemented a working example for you -- see code below. I have also made an analysis of your problem, as described below.

The code implements Metropolis-Hastings with Brownian motion proposal. I tinkered quite some time with initialization and step size. The solution is very much dependant on these values. In the code, they are called u and step size. The values in the code currently comes from a kind of manual annealing burn in -- I started with a wild guess for u and large step_size. At the end of that run, I wrote down the final u, and restarted the whole process at that u, but with smaller step_size. And repeated again and again. In the early runs, the acceptance rate is ca 0.5%. By the end, it is ca 50%. Reducing the step size further makes the samples autocorrelated, so I stopped here with the parameters in the code below.

The reason why this kind of annealing scheme is needed is because of the nasty periodic structure of the likliehood with sharp local maxima. I plotted the log likelihood and the histogram of samples to illustrate that. The histogram is extremely narrow and quite hard to see in the plot using this axis limits, but quite nice to work with interactively.

If the step size is too large, it will propose in the low-likelihood regions too often and get rejected. If the step size is too small, it get stuck in one of the small "bumps". The periodicity of the plot is due to aliasing. Run the code with differenet initializations and zoom around, and I think you will get a nice feel for it. :) I also played around with different noise levels, sample sizes, prior parameters etc. The conclusions changes with that of course. But in this low noise setting I have in the code below, the prior don't matter at all.

N.B. The plot shows the log likeliehood, since we are working with really small likelihoods. The use of logarithms can be crucial to get numerical stability.

The code also produces a different diagnostic plot

that tells us that the burn in period is long enough, that the autocorrelation between samples (after burn-in) is low, and that when I pick the expected posterior parameter, and generate synthetic data with it, it looks quite okay.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(999)


#
# Generate data for the observed time indices
#
dt = 0.1
M = 5
N = 100
u_true = 0.01 # the true u-value, to be estimated
lamda = 1 # assumed known
noise_variance = 0.01 # assumed known
t = np.array([N*k*dt + i*dt for i in range(5) for k in range(10)])
n_data = len(t)
x = np.exp(u_true*1j*t*4/lamda)
n = (
        np.random.standard_normal(size=n_data) +
        1j*np.random.standard_normal(size=n_data)
    )*np.sqrt(noise_variance)/np.sqrt(2)
z = x + n

#
# Configure and Initialize Metropolis-Hastings
#
sigma2_u = 1 # the variance of the gaussian prior
mu_u = 0     # the mean of the gaussian prior
proposal_step_size = 0.0001  # adjusted in manual "annealing" process
u = 0.014                    # initialization
get_proposal = lambda u:  u+np.random.standard_normal()*proposal_step_size # proposal is brownian motion
burn_in_samples = 500
mcmc_steps = burn_in_samples+ 2000

def log_proposal_density_ratio(u_new,u):
    """The proposals density is symmetric, so the ratio is always 1, and the log(1) = 0 always"""
    return 0

def log_likeliehood(u):
    """ compute the log likeliehood of this parameter
    as in
        log(p(z,u)) = log(p(u)) + log(p(z|u))   
    """
    x = np.exp(u*1j*t*4/lamda) #compute the predicted data sequence under this u
    n = z - x                  # compute the observed noise, assuming this predicted sequence
    n_real_std = np.real(n) * np.sqrt(2) / np.sqrt(noise_variance) # this is a vector of standard-normal distributed values, under the model assumptions
    n_imag_std = np.imag(n) * np.sqrt(2) / np.sqrt(noise_variance)
    ll1 = -0.5*np.log(2*np.pi) - 0.5*n_real_std**2                 # this is a vector of log likliehoods per observation
    ll2 = -0.5*np.log(2*np.pi) - 0.5*n_imag_std**2
    log_p_z_given_u = ll1.sum() + ll2.sum()
    log_p_u = -0.5*np.log(2*np.pi*sigma2_u)-0.5*(u-mu_u)**2/sigma2_u
    return log_p_u+log_p_z_given_u


#
# Run Metropolis-Hastings Algorithm
#
ll = log_likeliehood(u)
us = np.zeros(mcmc_steps)
did_accept = np.zeros(mcmc_steps)
for step in range(mcmc_steps):
    u_new = get_proposal(u)
    ll_new = log_likeliehood(u_new)
    accept_ratio = np.exp(ll_new - ll + log_proposal_density_ratio(u_new,u))
    rand = np.random.random()
    if rand < accept_ratio:
        u = u_new
        ll = ll_new
        did_accept[step] = 1
    else:
        did_accept[step] = 0

    us[step] = u

#
# Do diagnostics on the results
#
print(f"After burn in, {did_accept.mean():.1%} of the proposals were accepted.")
print(f"There are {len(np.unique(us[burn_in_samples:]))} unique values in the kept samples")

fig,axs = plt.subplots(3,1,figsize=plt.figaspect(1))
axs=axs.flatten()
variation = us[burn_in_samples:]-us[burn_in_samples:].mean()
variation /= np.linalg.norm(variation)
corrs = np.correlate(variation,variation,mode='same')
axs[0].plot(corrs[len(corrs)//2:],label='correlogram of samples')
axs[0].legend()
axs[1].plot(np.arange(burn_in_samples),us[:burn_in_samples],color='C3',label='burn in')
axs[1].plot(np.arange(burn_in_samples,mcmc_steps),us[burn_in_samples:],color='C2',label='kept samples')
axs[1].legend()
axs[2].scatter(t,np.real(x),label='real(x)')
axs[2].scatter(t,np.real(z),label='real(z)')
axs[2].scatter(t,np.real(np.exp(us.mean()*1j*t*4/lamda)),label='real(estimated x)')
axs[2].legend()


#
# Explanation plot
#
urange = np.linspace(-0.05,0.2,500)
lls = np.array([log_likeliehood(u) for u in urange])
fig,axs = plt.subplots(2,1,sharex=True)
axs=axs.flatten()
axs[0].hist(us[burn_in_samples:],label='MCMC samples (after burn in)',alpha=0.7)
axs[0].hist(us[:burn_in_samples],label='MCMC samples (burn in)',alpha=0.5)
axs[0].axvline(us.mean(),label=f'MCMC mean ={us.mean():.3f}',color='C1',linestyle='dotted')
axs[0].axvline(u_true,label=f'true u = {u_true:.3f}',color='C2',linestyle='dotted')
axs[0].legend()
axs[1].set_ylabel("Log Likliehood")
axs[1].set_xlabel("u")
axs[1].plot(urange,lls)
plt.show()

In summary:

I can't tell from your description if you had any problems with the implementation. It seems from the discussion with Sextus Empiricus that there might be misundrestanding about how to compute the likeliehood.
Metropolis Hastings can be used on this problem to generate samples from the posterior, if appropriate step size is used and it is initialized around a suitable local optimum.
The use of logarithms may be critical, as log likelihoods can be as small as -2500 .
For my number of data points and noise levels, the prior didn't have any notice'able effect. It can matter in your case.

Best Answer

Related Solutions

Estimation – How to Handle Non-Convergence in Markov Chain Monte Carlo

Related Question