ABC Model Selection – Selecting ABC Model from Posterior Samples

approximate-bayesian-computationbayes-factorsbayesianmodel selectionposterior

I would like to know if there is a general scheme to do model selection based on the posterior samples from a set of ABC (Approximate Bayesian Computation) runs for a given set of models.

Particularly I was confused by this question and answer which indicates that counting the accepted samples from resimulations from the posterior parameters provides an estimate for the model evidences (or Bayes factors). This seems in contradiction with this paper, where the fraction of accepted samples as a value for the evidence seems only true when sampling from the prior, not the posterior.

The same paper mentions slightly adapted schemes when the proposal distribution is not the prior, but they don't seem to apply for me as I'm not using ABC-SMC or ABC-MCMC.

So is there still a general way to get model evidences from ABC posteriors?

Besides I'm aware of the chat here citing this paper; I will certainly try to find out whether my summary statistics are sufficient for model selection (once there is a way to do it from posterior samples in principle).

Any help really appreciated!


EDIT: After studying this a bit further, I may have found a solution based on the above paper (Didelot et al., 2011), which I scetch below. Still, as I'm far from being an expert on this topic, it would be awesome if people would comment on this!

This solution also requires re-simulations, but can be done "closer" to the posterior samples than from an uninformed prior.

Let $D$ be the data, and $\theta$ the parameter vector of a given model $M$ that can simulate data $D_{\theta}$ with these parameters.

Then we have the parameter prior $p(\theta | M)$ of model $M$ and the desired model evidence given by $p(D|M)$.

Further $\pi_{\epsilon}(D_{\theta}|D)$ is (in a simple case) the indicator function ($1$ if distance $d(D,D_{\theta})<\epsilon$ else $0$).

From an ABC run, we obtained $N$ posterior samples $\theta_1$, …, $\theta_N$. Following "Algorithm 5" of that paper we have

For $i=1,…,N$
(1) Generate $\theta_i^{*} \sim Q(\theta|\theta_i)$
(2) Simulate $D_{\theta_{i}^{*}}$ using $\theta_i^{*}$
(3) Compute $w_i = \frac{p(\theta_i^{*}|M) \pi_{\epsilon}(D_{\theta_{i}^{*}}|D)}{q(\theta_i^{*})}$,

where $q(\theta_i^{*}) = \frac{1}{N}\sum_{j=1}^N Q(\theta_i^{*}|\theta_j)$. Then $p(D|M) \approx {\frac{1}{N}\sum_{i=1}^N w_i}$.

Above, $Q(\theta|\theta_i)$ is a distribution that functions as transition (or mutation or perturbation) kernel. I think the idea is to add "some noise" to the actual posterior samples $\theta_i$ to come up with proposals $\theta_i^{*}$ in each iteration. $q(\theta_i^{*})$ then computes the local volume, to be weighted with the prior to get above weights $w_i$.

If one chooses $Q$ to be the prior itself, above weights simplify to the more standard prior- and rejection-based ABC evidence calculation (i.e. $w_i = \pi_{\epsilon}(D_{\theta_{i}^{*}}|D)$). However, using some $Q$ closer to the posterior will result in more efficient acceptance rates.

From that paper I understand that $Q(\theta | \theta_i)$ should be heavy-tailed. So for practical purposes I'm thinking of a non-standardised t-distribution with mean $\theta_i$ and standard deviation that is on a scale of the posterior samples.

Does this make sense to anyone?

Best Answer

An alternate, free, solution is to run an ABC version of harmonic mean evidence approximation à la Newton & Raftery (1994). Since $$\mathcal Z=1\Big/\int \dfrac{\pi(\theta|D)}{p(D|\theta)}\,\text d\theta$$ the evidence can formally be approximated by $$\hat{\mathcal Z} =1\Big/\frac{1}{N}\sum_{i=1}^N\frac{1}{p(D|\theta_i)}\qquad\theta_i\sim\pi(\theta|D)$$ and its ABC version is $$\hat{\mathcal Z} =1\Big/\frac{1}{N}\sum_{i=1}^N\frac{1}{K_\epsilon(d(D,D^\star(\theta_i)))}\qquad\theta_i\sim\pi^\epsilon(\theta|D)$$ where $K_\epsilon(\cdot)$ is the kernel used for the ABC acceptance step and $d(\cdot,\cdot)$ is the distance used to measure the discrepancy between samples.

An attempt at using this approach on a toy model showed a moderazte discrepancy with the actual evidence. Using the model $$\theta\sim\mathcal N(0,10)\qquad x|\theta\sim\mathcal N(\theta,1)$$ with the observation $x^\text{o}=3$, the actual evidence is 8 $10^{-2}$ while the approximation returns a lower 2.5 $10^{-2}$ when the scale in the Cauchy kernel is chosen to produce a good fit of the ABC posterior:

enter image description here

However, estimating directly the marginal density with the same Cauchy kernel is producing a much closer approximation, namely 7.8 $10^{-2}$. Here is the R code I used in this experiment:

o=3;e=rnorm(1e6,sd=sqrt(10));k=1+100*(o+rnorm(1e6)-e)^2
print(c(1/mean(k[runif(1e6)*k<1]),mean(10/k/pi),dnorm(o,sd=sqrt(11))))
Related Question