Approximate Bayesian Computation – Making Adaptive Tolerance Threshold $\epsilon$

approximate-bayesian-computationbayesiandescriptive statisticsmarkov-chain-montecarloparticle filter

Briefly the Approximate Bayesian Computation instead of using the exact likelihood function $L(\theta;x)$ tries to approximate this function with the use of the observed summary statistics $s(x_{obs})$ of the data $x$.

The posterior distribution for the parameter $\theta$ can be defined in the following way

$$p_{\epsilon}(\theta|s(x_{obs}))\propto \pi(\theta) K_{\epsilon}(\rho(s(x_{obs}),s(x_{sim}))$$

$\bullet$ $s(x_{obs})$ are the observed summary statistics,

$\bullet$ $s(x_{sim})$ are the simulated summary statistics, i.e. a sample drawn from $x_{sim}\sim L(\theta;x)$,

$\bullet$ $\rho$ is a distance function between the summary statistics $s(x_{obs})$ and $s(x_{sim})$, it can be for example the Euclidean distance,

$\bullet$ $K_{\epsilon}(\cdot)$ is a probabilistic kernel, usually it is a normal distribution with mean equal to $0$ and variance equal to $\epsilon$. So, the closer the distance $\rho(s(x_{obs}),s(x_{sim}))$ is to $0$ the more weight the kernel $K_{\epsilon}(\cdot)$ gives to the $s(x_{sim})$ (or equivalently to the parameter value $\theta$ that used to simulate the $s(x_{sim})$,

$\bullet$ $\epsilon$ is the tolerance threshold, i.e. it tunes how far we want our simulated data to be from the observed ones.


I sketch out the Markov Chain Monte Carlo (MCMC) Algorithm that I'm using to sample from the posterior $p_{\epsilon}(\theta|s(x_{obs}))$

For $N$ MCMC iterations do

$1) \ \ \text{Initialize} \ \ \ \theta^{0}$

$2) \ \ \theta^{i} \sim p(\theta^{i}|\theta^{i-1})$

$3) \ \ s(x^{i}_{sim})\sim L(\theta^{i};x)$

$4) \ \ A = \frac{\pi(\theta^{i}) K_{\epsilon}(\rho(s(x_{obs}),s(x^{i}_{sim}))p(\theta^{i-1}|\theta^{i})}{\pi(\theta^{i-1}) K_{\epsilon}(\rho(s(x_{obs}),s(x^{i-1}_{sim}))p(\theta^{i}|\theta^{i-1})}$

$5) \ \ \text{If} \ Unif(0,1) \leq A \ \text{then accept} \ \theta^{i}$


I would like to find a way to start from a big value for the tolerance threshold $\epsilon$ and based on how the MCMC performs to decrease it gradually, i.e. $\epsilon_{1}>\epsilon_{2}>…>\epsilon_{T}$.

There is a huge literature for this topic but haven't found something relevant to my case. Two papers that discuss the same problem are,

and

However, both papers work with particles, where in my case I do not think is appropriate since the simulation of a single $s(x_{sim})$ is really time consuming. Hence, I want something less computationally intensive.

I guess an easy approach (but might be naive) is to decrease the tolerance threshold based on the performance of the acceptance rate or effective sample size (ESS). Ideally, I would like something like the following, when the acceptance rate or ESS are large, then we can decrease the tolerance threshold but I haven't found something in the literature that applies to my case and doesn't work with particles.


$\underline{\text{More technical details:}}$

In my problem the parameter $\theta$ is made of

$\bullet$ two parameters $p,\phi \in [0,1]$, with $Beta(1,1)$ prior distributions

$\bullet$ a matrix $N$, with integer elements and dimensions $K\times K$, with a $Poisson$ prior distribution placed on top of the matrix, i.e. $N_{ij}\sim Pois(\lambda_{ij})$

So, in essence (let's assume that $K=4$) if we collapse all the parameters to a vector called $\theta$, we will have

$$\theta = (p,\phi,N_{12},N_{13},N_{14},N_{23},N_{24},N_{34})$$

The data simulation is quite complex, but let as denote is as $L(x;\theta)$, which take a few second to run, but (since ABC is iteration hungry it will be computationally intensive) when I run it within the MCMC where it gets called many time it makes the algorithm computationally intensive.

Best Answer

First, I'm not sure SMC-ABC is the same as MCMC-ABC with a threshold reduction; if I'm not mistaken, it's more similar to rejection-ABC with threshold reduction. So, it might not be exactly what you have in mind. You should check the maybe this paper here which explains (one version of) SMC-ABC very nicely and also gives some brief comparison to the original SMC-ABC by prof. Scott Sisson.

Second, I completely agree with Xi'an's comment. MCMC is "slow", and ABC requires many simulations. So, if your bottle neck is the simulation part, not sure how you can continue - reducing $\epsilon$ or not.

As Xi'an suggested, you might be able to use a surrogate to replace the actual simulator. Very simple ones can be Gaussian-Process regressions (check out this paper). Maybe a heteroscedastic GP or a multiple output GP (MOGP). They are however quite simplistic and are limited to a single mode.

There is newer literature in the Machine Learning field - which is called Simulation Based Inference (SBI). They use Neural-Density-Estimators (e.g., Mixture Density Networks, and different SotA Normalizing Flows) and they don't require any thresholds $\epsilon$. I have a series about it on YouTube which you can find here. I see it as a natural development to the "Regression Adjustment" papers that improved classical ABC algorithms by uncovering the underlying structure between parameters $\theta$ and data $x$. There is also a great library in Python called "SBI".

My master thesis in Statistics (which is being submitted now) deals with reducing simulator calls in ABC. I tried two methods - one is using a Neural-Density-Estimator as a surrogate for the actual simulator - which is more expressive than GP-surrogates. And another one is using "Support Points" which are a class of energy-distance based representative points instead of simple sampling. Both might be helpful to reducing simulator calls, though the NDE surrogate seems to be better at-least for small dimension problems.

Related Question