Solved – How to calculate the Bayesian or Schwarz Information Criterion (BIC) for a multilevel bayesian model

aicbayesianbichierarchical-bayesianmultilevel-analysis

The BIC is defined as (according to wikipedia)

$BIC = k\ln(n) – 2\ln(\hat{L})$

where the likelihood $\hat{L} = p(x|\hat{\theta},M)$ where $M$ is the model, $x$ are the data, and $\hat{\theta}$ are the to-be-inferred parameters of the model, set at their highest-likelihood point. Although the question could similarly apply for the AIC.

In my case, I am using MCMC to estimate the posterior distribution:

$p(\theta|x,M)\propto p(x|\theta,M)p(\theta)$

However, the issue in a multilevel model is that my model parameters $\theta$ are split into two blocks, $\theta=\theta_1,\theta_2$. Such that only $\theta_1$ have a direct conditional relation with $x$:

$\hat{L} = p(x|\theta,M)=p(x|\theta_1,\theta_2,M)=p(x|\theta_1,M)$

And instead, $\theta_2$ (I believe they would be called the hyperparameters) parameterise the distribution of $\theta_1$:

$p(\theta|M)=p(\theta_1,\theta_2|M)=p(\theta_1|\theta_2,M)P(\theta_2|M)$

which happens to not be contained in the likelihood function I defined, but instead in what you might call the prior $p(\theta)$, the way I wrote it above. Since the likelihood function $p(x|\theta,M)$ then doesn't contain what is probably the crucial part of my bayesian model, I feel like there must be something wrong.

So I have two questions:

1) I feel like I need to include the "forward model" for $\theta_1|\theta_2,M$ in the calculation of the BIC. So does that mean that I should define the likelihood function in the BIC as $\hat{L}=p(x|\hat{\theta_1},M)p(\hat{\theta_1}|\hat{\theta_2},M)$ instead of just $p(x|\hat{\theta_1},M)$? And should it be the highest likelihood point of $\theta_1$ or be marginalised over $\theta_1$:

$\hat{L}=p(x|\hat{\theta_2},M)=\int_{\theta_1} p(x|\theta_1,M)p(\theta_1|\hat{\theta_2},M)d\theta_1$

To me, the latter seems like the most logical solution. Technically I could have removed the $\theta_1$ "level" of the hierarchical model by directly incorporating the above integral into some likelihood $p(x|\theta_2,M)$ without changing the results in $\theta_2$, turning it into a two-level model.


2) What exactly is "the highest likelihood point" in a Bayesian model? Is the point $(\theta_1,\theta_2)$ with the largest value $p(x|\theta_1,\theta_2,M)$ (or however we choose to define the likelihood, see above), or is it the highest probability posterior $p(\theta_1,\theta_2|x,M)$ which will be similar but also includes the priors?


BONUS QUESTION: Are $\theta_2$ really called the hyperparameters, or are they just model parameters and would the hyperparameters be the ones that parametrise the prior $p(\theta_2|M)$?

Best Answer

Actually you may have a look at chapter 8.4.2 in Murphys book 'Machine Learning: A Probabilistic Perspective', where the BIC is nicely derived from the marginal likelihood. A flat prior is assumed there, however under some specific assumption the presented method (Laplace approximation) may be applied to a nonuniform prior as well, which results in the likelihood in the BIC being replaced by the posterior evaluated at the MAP estimate (similar to what you have written in Question 1, but without the integration). I have some scanned, self written pdf about this that I may share (unfortunately it's in half english/half german, as I had intended it for private use only).

Now, BICs rooting in the marginal likelihood also makes it clear that you were on the right track regarding question 1: If you can analytically compute the integral over $\theta_1$, then you do not have to refer to Laplaces approximation regarding $\theta_1$ and thus will end up having a better approximation of the marginal likelihood (which is what you are really interested in, BIC is just a substitute). The Laplace approximation may then be applied only to the resulting expression, i.e. $p(x|\theta_2,M)$. In other words, your approach is correct, you may swap the $\hat{L}$ of the original BIC with $\hat{L} =p(x|\theta_2,M)$.

Now w.r.t. question 2, as Laplaces method is based on the maximum exponent of the integrant in question (which is the joint probability of $x,\theta_1,\theta_2$ in case of the marginal likelihood) you must use the MAP estimate as your "highest likelihood point", i.e. the point $$(\theta^*_1,\theta^*_2) = argmax_{\{\theta_1,\theta_2\}} p(\theta_1,\theta_2|x,M) $$

Hope this helps...

Related Question