Approximate Bayesian Computation – Making Adaptive Tolerance Threshold $\epsilon$

approximate-bayesian-computationbayesiandescriptive statisticsmarkov-chain-montecarloparticle filter

Briefly the Approximate Bayesian Computation instead of using the exact likelihood function $L(\theta;x)$ tries to approximate this function with the use of the observed summary statistics $s(x_{obs})$ of the data $x$.

The posterior distribution for the parameter $\theta$ can be defined in the following way

$$p_{\epsilon}(\theta|s(x_{obs}))\propto \pi(\theta) K_{\epsilon}(\rho(s(x_{obs}),s(x_{sim}))$$

$\bullet$ $s(x_{obs})$ are the observed summary statistics,

$\bullet$ $s(x_{sim})$ are the simulated summary statistics, i.e. a sample drawn from $x_{sim}\sim L(\theta;x)$,

$\bullet$ $\rho$ is a distance function between the summary statistics $s(x_{obs})$ and $s(x_{sim})$, it can be for example the Euclidean distance,

$\bullet$ $K_{\epsilon}(\cdot)$ is a probabilistic kernel, usually it is a normal distribution with mean equal to $0$ and variance equal to $\epsilon$. So, the closer the distance $\rho(s(x_{obs}),s(x_{sim}))$ is to $0$ the more weight the kernel $K_{\epsilon}(\cdot)$ gives to the $s(x_{sim})$ (or equivalently to the parameter value $\theta$ that used to simulate the $s(x_{sim})$,

$\bullet$ $\epsilon$ is the tolerance threshold, i.e. it tunes how far we want our simulated data to be from the observed ones.

I sketch out the Markov Chain Monte Carlo (MCMC) Algorithm that I'm using to sample from the posterior $p_{\epsilon}(\theta|s(x_{obs}))$

For $N$ MCMC iterations do

$1) \ \ \text{Initialize} \ \ \ \theta^{0}$

$2) \ \ \theta^{i} \sim p(\theta^{i}|\theta^{i-1})$

$3) \ \ s(x^{i}_{sim})\sim L(\theta^{i};x)$

$4) \ \ A = \frac{\pi(\theta^{i}) K_{\epsilon}(\rho(s(x_{obs}),s(x^{i}_{sim}))p(\theta^{i-1}|\theta^{i})}{\pi(\theta^{i-1}) K_{\epsilon}(\rho(s(x_{obs}),s(x^{i-1}_{sim}))p(\theta^{i}|\theta^{i-1})}$

$5) \ \ \text{If} \ Unif(0,1) \leq A \ \text{then accept} \ \theta^{i}$

I would like to find a way to start from a big value for the tolerance threshold $\epsilon$ and based on how the MCMC performs to decrease it gradually, i.e. $\epsilon_{1}>\epsilon_{2}>…>\epsilon_{T}$.

There is a huge literature for this topic but haven't found something relevant to my case. Two papers that discuss the same problem are,

Adaptive Approximate Bayesian Computation
Tolerance Selection

and

An adaptive sequential Monte Carlo method for approximate
Bayesian computation}.

However, both papers work with particles, where in my case I do not think is appropriate since the simulation of a single $s(x_{sim})$ is really time consuming. Hence, I want something less computationally intensive.

I guess an easy approach (but might be naive) is to decrease the tolerance threshold based on the performance of the acceptance rate or effective sample size (ESS). Ideally, I would like something like the following, when the acceptance rate or ESS are large, then we can decrease the tolerance threshold but I haven't found something in the literature that applies to my case and doesn't work with particles.

$\underline{\text{More technical details:}}$

In my problem the parameter $\theta$ is made of

$\bullet$ two parameters $p,\phi \in [0,1]$, with $Beta(1,1)$ prior distributions

$\bullet$ a matrix $N$, with integer elements and dimensions $K\times K$, with a $Poisson$ prior distribution placed on top of the matrix, i.e. $N_{ij}\sim Pois(\lambda_{ij})$

So, in essence (let's assume that $K=4$) if we collapse all the parameters to a vector called $\theta$, we will have

$$\theta = (p,\phi,N_{12},N_{13},N_{14},N_{23},N_{24},N_{34})$$

The data simulation is quite complex, but let as denote is as $L(x;\theta)$, which take a few second to run, but (since ABC is iteration hungry it will be computationally intensive) when I run it within the MCMC where it gets called many time it makes the algorithm computationally intensive.

Best Answer

First, I'm not sure SMC-ABC is the same as MCMC-ABC with a threshold reduction; if I'm not mistaken, it's more similar to rejection-ABC with threshold reduction. So, it might not be exactly what you have in mind. You should check the maybe this paper here which explains (one version of) SMC-ABC very nicely and also gives some brief comparison to the original SMC-ABC by prof. Scott Sisson.

Second, I completely agree with Xi'an's comment. MCMC is "slow", and ABC requires many simulations. So, if your bottle neck is the simulation part, not sure how you can continue - reducing $\epsilon$ or not.

As Xi'an suggested, you might be able to use a surrogate to replace the actual simulator. Very simple ones can be Gaussian-Process regressions (check out this paper). Maybe a heteroscedastic GP or a multiple output GP (MOGP). They are however quite simplistic and are limited to a single mode.

There is newer literature in the Machine Learning field - which is called Simulation Based Inference (SBI). They use Neural-Density-Estimators (e.g., Mixture Density Networks, and different SotA Normalizing Flows) and they don't require any thresholds $\epsilon$. I have a series about it on YouTube which you can find here. I see it as a natural development to the "Regression Adjustment" papers that improved classical ABC algorithms by uncovering the underlying structure between parameters $\theta$ and data $x$. There is also a great library in Python called "SBI".

My master thesis in Statistics (which is being submitted now) deals with reducing simulator calls in ABC. I tried two methods - one is using a Neural-Density-Estimator as a surrogate for the actual simulator - which is more expressive than GP-surrogates. And another one is using "Support Points" which are a class of energy-distance based representative points instead of simple sampling. Both might be helpful to reducing simulator calls, though the NDE surrogate seems to be better at-least for small dimension problems.

Related Solutions

Solved – How do ABC and MCMC differ in their applications

Some additional comments on top of Björn's answer:

ABC was first introduced by Rubin (1984) as an explanation of the nature of Bayesian inference, rather than for computational purposes. In this paper he explained how the sampling distribution and the prior distribution interact to produce the posterior distribution.
ABC is however primarily exploited for computational reasons. Population geneticists came up with the method on tree-based models where the likelihood of the observed sample was intractable. The MCMC (Data Augmentation) schemes that were available in such settings were awfully inefficient and so was importance sampling, even with a parameter of a single dimension... At its core, ABC is a substitute to Monte Carlo methods like MCMC or PMC when those are not available for all practical purposes. When they are available, ABC appears as a proxy that may be used to calibrate them if it runs faster.
In a more modern perspective, I personally consider ABC as an approximate inference method rather than a computational technique. By building an approximate model, one can draw inference on the parameter of interest without necessarily relying on a precise model. While some degree of validation is necessary in this setting, it is not less valid than doing model averaging or non-parametrics. In fact, ABC can be seen as a special type of non-parametric Bayesian statistics.
It can also be shown that (noisy) ABC is a perfectly well-defined Bayesian approach if one replaces the original model and data with a noisy one. As such it allows for all Bayesian inferences one can think of. Including testing. Our input to the debate about ABC and hypothesis testing is that the approximate model underlying ABC may end up as poorly equipped to assess the relevance of an hypothesis given the data, but not necessarily, which is just as well since most applications of ABC in population genetics are concerned with model choice.
In an even more recent perspective, we can see ABC as a Bayesian version of indirect inference where the parameters of a statistical model are related with the moments of a pre-determined statistic. If this statistic is enough (or sufficient in the vernacular sense) to identify these parameters, ABC can be shown to converge to the true value of the parameters with the number of observations.

ABC Model Selection – Selecting ABC Model from Posterior Samples

An alternate, free, solution is to run an ABC version of harmonic mean evidence approximation à la Newton & Raftery (1994). Since $$\mathcal Z=1\Big/\int \dfrac{\pi(\theta|D)}{p(D|\theta)}\,\text d\theta$$ the evidence can formally be approximated by $$\hat{\mathcal Z} =1\Big/\frac{1}{N}\sum_{i=1}^N\frac{1}{p(D|\theta_i)}\qquad\theta_i\sim\pi(\theta|D)$$ and its ABC version is $$\hat{\mathcal Z} =1\Big/\frac{1}{N}\sum_{i=1}^N\frac{1}{K_\epsilon(d(D,D^\star(\theta_i)))}\qquad\theta_i\sim\pi^\epsilon(\theta|D)$$ where $K_\epsilon(\cdot)$ is the kernel used for the ABC acceptance step and $d(\cdot,\cdot)$ is the distance used to measure the discrepancy between samples.

An attempt at using this approach on a toy model showed a moderazte discrepancy with the actual evidence. Using the model $$\theta\sim\mathcal N(0,10)\qquad x|\theta\sim\mathcal N(\theta,1)$$ with the observation $x^\text{o}=3$, the actual evidence is 8 $10^{-2}$ while the approximation returns a lower 2.5 $10^{-2}$ when the scale in the Cauchy kernel is chosen to produce a good fit of the ABC posterior:

However, estimating directly the marginal density with the same Cauchy kernel is producing a much closer approximation, namely 7.8 $10^{-2}$. Here is the R code I used in this experiment:

o=3;e=rnorm(1e6,sd=sqrt(10));k=1+100*(o+rnorm(1e6)-e)^2
print(c(1/mean(k[runif(1e6)*k<1]),mean(10/k/pi),dnorm(o,sd=sqrt(11))))

Best Answer

Related Solutions

Solved – How do ABC and MCMC differ in their applications

ABC Model Selection – Selecting ABC Model from Posterior Samples

Related Question