Solved – MCMC how to choose standard deviation of sampling density

markov-chain-montecarlometropolis-hastingsposteriorsampling

I'm using Metropolis-Hastings MCMC to find the set of parameters $\theta$ of a model $M$ that best fits my experimental data $x$ (with noise), where it is possible to directly calculate $x$ (without noise) from $\theta$ but not to directly calculate $\theta$ from $x$. By sampling the (posterior) probability distribution of each $\theta$ being the best fit, which is $P(\theta|x)$, MCMC will also let me look at things like parameter correlations and sensitivity of each parameter to the fit. First of all, I'd like to confirm whether my basic knowledge about MCMC is correct.

The way the Metropolis-Hastings algorithm works is that if I know a function $f(\theta|x)$ that is proportional to the posterior, I can sample the posterior by starting at $\theta_0$ and moving with a proposal density $g(\theta_{i+1}|\theta_i)$ that suggests a candidate set of parameters given the current set of parameters. Each time a proposal is generated, I calculate the acceptance ratio $\alpha=f(\theta_{i+1}|x)/f(\theta_i|x)$ that determines whether to stay at $\theta_i$ or go to $\theta_{i+1}$. (I think $f$ is called the sampling density but I'm not sure.)

For the proposal density $g$, I'm using a Gaussian centered at $\theta_i$ with a fixed standard deviation $s$ that I specify, whish seems common from what I've read. The question is what to use for $f$. Since the goal is to minimize the error between $x$ and the calculated data $x_M(\theta)$, I have a least-squares fitness function, $F(\theta)=\frac{1}{N}\sum^N\left((x-x_M(\theta))^2/x^2\right)$, where $N$ is the number of points in $x$ and $x_M$. The reason it's $x^2$ in the denominator is to normalize the magnitude of $x$, which can be scaled differently depending on how one processes the data. Then I assume that the posterior is Gaussian, so for $f$ I use a Gaussian centered at $F=0$ with standard deviation $\sigma$, where $\sigma$ comes from the model. Therefore, the acceptance ratio is $\alpha=\mathrm{exp}\left(-\frac{1}{2\sigma^2}(F(\theta_{i+1})^2-F(\theta_i)^2)\right)$.

My question is, what is the meaning of $\sigma^2$ in this case? Does it describe the variance in the noise of the data (since $F$ depends partly on the variance in the data) or the variance in how close the parameters are to the best fit parameters (since $F$ also depends on this)? (Based on my least-squares equation, for $F=0$ there would have to be no noise in the data, and the fitted parameters would all have to be perfectly correct.) I'm thinking it's the latter case because it's basically a rewording of the posterior distribution, but in that case I can't wrap my head around how to model $\sigma^2$. I've been reading this page that towards the bottom talks about fitting $\sigma$ as a parameter, which confuses me even more.

Best Answer

There are several papers that propose methods to adapt the Metropolis sampler in order to achieve certain optimality criterion. The most popular one consists of controlling the acceptance rate. Under certain regularity conditions of the target distribution, it has been shown that the Metropolis algorithm performs well if the acceptance rate is $\approx 0.44$ in dimension 1, and $\approx 0.234$ in higher dimensions:

Weak convergence and optimal scaling of random walk Metropolis algorithms (1997). A. Gelman, W. R. Gilks, and G. O. Roberts.

On the other hand, other self-adaptive algorithms that make use of running parallel chains have been developed in:

A general purpose sampling algorithm for continuous distributions (the t-walk) (2010). J. Andrés Christen and Colin Fox

Also implemented in the R package Rtwalk.

There is no general criterion to make efficient the Metropolis algorithm in every single scenario, but these two algorithms work well in non-pathological scenarios and low to moderate dimensions (<20).

Related Question