ABC Model Selection – Selecting ABC Model from Posterior Samples

approximate-bayesian-computationbayes-factorsbayesianmodel selectionposterior

I would like to know if there is a general scheme to do model selection based on the posterior samples from a set of ABC (Approximate Bayesian Computation) runs for a given set of models.

Particularly I was confused by this question and answer which indicates that counting the accepted samples from resimulations from the posterior parameters provides an estimate for the model evidences (or Bayes factors). This seems in contradiction with this paper, where the fraction of accepted samples as a value for the evidence seems only true when sampling from the prior, not the posterior.

The same paper mentions slightly adapted schemes when the proposal distribution is not the prior, but they don't seem to apply for me as I'm not using ABC-SMC or ABC-MCMC.

So is there still a general way to get model evidences from ABC posteriors?

Besides I'm aware of the chat here citing this paper; I will certainly try to find out whether my summary statistics are sufficient for model selection (once there is a way to do it from posterior samples in principle).

Any help really appreciated!

EDIT: After studying this a bit further, I may have found a solution based on the above paper (Didelot et al., 2011), which I scetch below. Still, as I'm far from being an expert on this topic, it would be awesome if people would comment on this!

This solution also requires re-simulations, but can be done "closer" to the posterior samples than from an uninformed prior.

Let $D$ be the data, and $\theta$ the parameter vector of a given model $M$ that can simulate data $D_{\theta}$ with these parameters.

Then we have the parameter prior $p(\theta | M)$ of model $M$ and the desired model evidence given by $p(D|M)$.

Further $\pi_{\epsilon}(D_{\theta}|D)$ is (in a simple case) the indicator function ($1$ if distance $d(D,D_{\theta})<\epsilon$ else $0$).

From an ABC run, we obtained $N$ posterior samples $\theta_1$, …, $\theta_N$. Following "Algorithm 5" of that paper we have

For $i=1,…,N$
(1) Generate $\theta_i^{*} \sim Q(\theta|\theta_i)$
(2) Simulate $D_{\theta_{i}^{*}}$ using $\theta_i^{*}$
(3) Compute $w_i = \frac{p(\theta_i^{*}|M) \pi_{\epsilon}(D_{\theta_{i}^{*}}|D)}{q(\theta_i^{*})}$,

where $q(\theta_i^{*}) = \frac{1}{N}\sum_{j=1}^N Q(\theta_i^{*}|\theta_j)$. Then $p(D|M) \approx {\frac{1}{N}\sum_{i=1}^N w_i}$.

Above, $Q(\theta|\theta_i)$ is a distribution that functions as transition (or mutation or perturbation) kernel. I think the idea is to add "some noise" to the actual posterior samples $\theta_i$ to come up with proposals $\theta_i^{*}$ in each iteration. $q(\theta_i^{*})$ then computes the local volume, to be weighted with the prior to get above weights $w_i$.

If one chooses $Q$ to be the prior itself, above weights simplify to the more standard prior- and rejection-based ABC evidence calculation (i.e. $w_i = \pi_{\epsilon}(D_{\theta_{i}^{*}}|D)$). However, using some $Q$ closer to the posterior will result in more efficient acceptance rates.

From that paper I understand that $Q(\theta | \theta_i)$ should be heavy-tailed. So for practical purposes I'm thinking of a non-standardised t-distribution with mean $\theta_i$ and standard deviation that is on a scale of the posterior samples.

Does this make sense to anyone?

Best Answer

An alternate, free, solution is to run an ABC version of harmonic mean evidence approximation à la Newton & Raftery (1994). Since $$\mathcal Z=1\Big/\int \dfrac{\pi(\theta|D)}{p(D|\theta)}\,\text d\theta$$ the evidence can formally be approximated by $$\hat{\mathcal Z} =1\Big/\frac{1}{N}\sum_{i=1}^N\frac{1}{p(D|\theta_i)}\qquad\theta_i\sim\pi(\theta|D)$$ and its ABC version is $$\hat{\mathcal Z} =1\Big/\frac{1}{N}\sum_{i=1}^N\frac{1}{K_\epsilon(d(D,D^\star(\theta_i)))}\qquad\theta_i\sim\pi^\epsilon(\theta|D)$$ where $K_\epsilon(\cdot)$ is the kernel used for the ABC acceptance step and $d(\cdot,\cdot)$ is the distance used to measure the discrepancy between samples.

An attempt at using this approach on a toy model showed a moderazte discrepancy with the actual evidence. Using the model $$\theta\sim\mathcal N(0,10)\qquad x|\theta\sim\mathcal N(\theta,1)$$ with the observation $x^\text{o}=3$, the actual evidence is 8 $10^{-2}$ while the approximation returns a lower 2.5 $10^{-2}$ when the scale in the Cauchy kernel is chosen to produce a good fit of the ABC posterior:

However, estimating directly the marginal density with the same Cauchy kernel is producing a much closer approximation, namely 7.8 $10^{-2}$. Here is the R code I used in this experiment:

o=3;e=rnorm(1e6,sd=sqrt(10));k=1+100*(o+rnorm(1e6)-e)^2
print(c(1/mean(k[runif(1e6)*k<1]),mean(10/k/pi),dnorm(o,sd=sqrt(11))))

Related Solutions

Model Selection – How to Calculate Weights in ABC-SMC for Bayesian Computation?

If you read the reference [4] in the tutorial, you can see a very detailed description of the ABC-SMC algorithm:

At the first iteration, the weights are all equal to one. At later iterations, because a new proposal which is a mixture of $K_t$ kernels is used instead of the prior, the weights are the ratios of the prior values to the mixtures values, as explained in S2.2.

There are several R packages implementing ABC-SMC, including easyABC, which, as an R package, should be easy (!) enough to install & experimen with.

Simply type install.packages("EasyABC") from within R.

There is for instance an ABC_sequential function that covers several sequential methods, including the one we proposed in Beaumont et al. (2009):

> ABC_Beaumont<-ABC_sequential(method="Beaumont", model=toy_model,
+ prior=toy_prior, nb_simul=n, summary_stat_target=sum_stat_obs,
+ tolerance_tab=tolerance)

"Method must be Beaumont, Drovandi, Delmoral, Lenormand or Emulation"

Approximate Bayesian Computation – Making Adaptive Tolerance Threshold $\epsilon$

First, I'm not sure SMC-ABC is the same as MCMC-ABC with a threshold reduction; if I'm not mistaken, it's more similar to rejection-ABC with threshold reduction. So, it might not be exactly what you have in mind. You should check the maybe this paper here which explains (one version of) SMC-ABC very nicely and also gives some brief comparison to the original SMC-ABC by prof. Scott Sisson.

Second, I completely agree with Xi'an's comment. MCMC is "slow", and ABC requires many simulations. So, if your bottle neck is the simulation part, not sure how you can continue - reducing $\epsilon$ or not.

As Xi'an suggested, you might be able to use a surrogate to replace the actual simulator. Very simple ones can be Gaussian-Process regressions (check out this paper). Maybe a heteroscedastic GP or a multiple output GP (MOGP). They are however quite simplistic and are limited to a single mode.

There is newer literature in the Machine Learning field - which is called Simulation Based Inference (SBI). They use Neural-Density-Estimators (e.g., Mixture Density Networks, and different SotA Normalizing Flows) and they don't require any thresholds $\epsilon$. I have a series about it on YouTube which you can find here. I see it as a natural development to the "Regression Adjustment" papers that improved classical ABC algorithms by uncovering the underlying structure between parameters $\theta$ and data $x$. There is also a great library in Python called "SBI".

My master thesis in Statistics (which is being submitted now) deals with reducing simulator calls in ABC. I tried two methods - one is using a Neural-Density-Estimator as a surrogate for the actual simulator - which is more expressive than GP-surrogates. And another one is using "Support Points" which are a class of energy-distance based representative points instead of simple sampling. Both might be helpful to reducing simulator calls, though the NDE surrogate seems to be better at-least for small dimension problems.

Best Answer

Related Solutions

Model Selection – How to Calculate Weights in ABC-SMC for Bayesian Computation?

Approximate Bayesian Computation – Making Adaptive Tolerance Threshold $\epsilon$

Related Question