Bayesian – What is the BIC Prior for Bayesian Linear Regression?

bayesianbicrregression

Is there a prior probability distribution associated with BIC (Bayesian Information Criterion)?

I ask this because in the R package BAS there is a linear modeling function that can be called as follows:

bas.lm(y ~ x, data = my_dataframe, prior = 'BIC')

and the documentation is not clear on how the priors are set given the 'BIC' argument.

The posterior distributions come out to be normal looking (student-T actually I think). They could for example look something like this:
posteriors

In summary, I am wondering what prior distributions are being used for the parameters?

Best Answer

I cannot offer you a definite answer for that particular package, but in general, the Bayesian Information Criterion is a large-sample approximation to the Bayes factor. As such, as the likelihood dominates the prior in large samples, the effect of the prior is neglected.

Here is a somewhat heuristic derivation.

We can obtain a large sample normal approximation to the log-likelihood $l(\theta|y)$ in the multiparameter case. A multivariate Taylor expansion gives \begin{eqnarray} l(\theta|y)&\approx&l(\hat\theta|y)+(\theta-\hat\theta)'\frac{\partial l(\hat\theta|y)}{\partial\theta}-\frac{n}{2}(\theta-\hat\theta)'V^{-1}(\theta-\hat\theta)\notag\\ &=& l(\hat\theta|y)-\frac{n}{2}(\theta-\hat\theta)'V^{-1}(\theta-\hat\theta) \end{eqnarray} where $$ V=\left(-\frac{1}{n}\sum_j\frac{\partial^2 l(\hat\theta|y_j)}{\partial\theta\partial\theta'}\right)^{-1}=\left(-\frac{1}{n}\frac{\partial^2 l(\hat\theta|y)}{\partial\theta\partial\theta'}\right)^{-1} $$ Let $$m_i(y)=\int f_{i}\left(y|\theta_{i},M_i\right)\pi_{i}\left(\theta_{i}|M_i\right)d\theta_{i}$$

Substituting the exponential of the approximate expression for $l(\theta|y)$ gives \begin{eqnarray*} m_i(y)&\approx&L(\hat\theta_i|y)\int\exp\left(-\frac{n}{2}(\theta_i-\hat\theta_i)'V_i^{-1}(\theta_i-\hat\theta_i)\right)\pi_i(\theta_i|M_i)d\theta_i\\ &\approx&L(\hat\theta_i|y)\pi_i(\hat\theta_i|M_i)\int\exp\left(-\frac{n}{2}(\theta_i-\hat\theta_i)'V_i^{-1}(\theta_i-\hat\theta_i)\right)d\theta_i, \end{eqnarray*} where the second approximation arises because the $\exp$ part dominates around $\hat\theta_i$ (look for Laplace approximation to get a rigorous argument here).

As the integrand now is a kernel of a multivariate normal distribution with covariance matrix $V_i/n$, carrying out the integration gives \begin{eqnarray*} m_i(y)&\approx&L(\hat\theta_i|y)\pi_i(\hat\theta_i|M_i)(2\pi)^{d_i/2}|n^{-1}V_i|^{1/2}\\ &=&L(\hat\theta_i|y)\pi_i(\hat\theta_i|M_i)(2\pi)^{d_i/2}n^{-d_i/2}|V_i|^{1/2}, \end{eqnarray*} where $d_i$ is the number of parameters in $\theta_i$.

The log-Bayes factor for models 1 and 2 can now be approximated by \begin{eqnarray*} \log(B_{12})&\approx&\left[\log\left(\frac{L_1(\hat\theta_1|y)}{L_2(\hat\theta_2|y)}\right)-\frac{d_1-d_2}{2}\log(n)\right]\\ &&+\left[\log\left(\frac{\pi_1(\hat\theta_1|M_1)}{\pi_2(\hat\theta_2|M_2)}\right)+\frac{1}{2}\log\frac{|V_1|}{|V_2|}+\frac{d_1-d_2}{2}\log(2\pi)\right] \end{eqnarray*} The second squared brackets can be neglected for large $n$. The first term is the log-likelihood ratio, which is large if the data favors $M_1$. The second term penalizes larger models.

Related Question