Solved – Analogous measure of AIC which uses the posterior distribution for model selection

aicbayesianmarkov-chain-montecarlomodel selection

Suppose the following problem: I have $n$ models, $M_k$, each with parameters $\mathbf{\theta}_k$ for a data set $D$. There where previous observations of a subset of the parameters which are common to every model $M_k$ (i.e., I have well defined priors for a subset of the parameters $\theta_k$), so I performed an MCMC algorithm in order to obtain the posterior distribution of each model using that prior information, i.e., I have $p(\theta_k|D,M_k)$, and have to decide which of those models is the 'correct' one.

I was thinking in defining what do I mean by 'the correct' one, and came up with the idea that I have to decide which of the posterior distributions is closer to the 'real' posterior distribution that generated the data (which may or may not be in my set of posterior distributions). I was thinking of using bayes factors, but I keep thinking that I need something like the AIC which, instead of using the likelihood and the corresponding MLE estimates, uses the posterior distributions and the corresponding maximum-a-posteriori estimates. My idea is to obtain an unbiased (or nearly unbiased) estimator of the KL divergence between the real posterior and my posteriors (understanding that the AIC is an estimator of the KL divergence between the 'real' likelihood and the likelihood of my models).

Is there something like this in the statistical literature? I'm just kind of crazy of thinking the problem like this?

Best Answer

None of these information criteria are unbiased, but under some conditions they are consistent estimators of the out-of-sample deviance. They also all utilize the likelihood in some fashion, but the WAIC and the LOOIC differ from the AIC and the DIC in that the former two average the likelihood for each observation over (draws from) the posterior distribution, whereas the latter two plug in point estimates. In this sense, the WAIC and LOOIC are preferable because they do not make an assumption that the posterior distribution is multivariate normal, with the LOOIC being somewhat preferable to the WAIC because it can be made more robust to outliers and has a diagnostic that can be evaluated to see if its assumptions are met.

Overview article

More detail about the practicalities

R package

Related Solutions

Solved – What would be the reason that the posterior distribution looks like the prior using MCMC

Assuming your likelihood and MCMC work correctly, which we can't know of course, and assuming your new variables are not ineffective:

If you increase the number of parameters under calibration with data fixed, an increase in marginal posterior uncertainty up to the point that the marginal posterior looks like the prior is quite common.
This happens if your data is not any more informative enough to constrain all parameters at once to a particular value (i.e. you have trade-offs in parameter space = equifinality)
A way to check this is to look at the correlation between parameters in the posterior - typically, you will see that while the marginal posteriors look flat or like the prior, the the multivariate posterior space is smaller than the multivariate prior space (i.e. you see correlations between the parameters that were not in the prior).
A further check would be to compare the prior predictive distribution to the posterior predictive distribution.
I'm sure there are better references, but an example of this phenomenon is in the appendix of 1, where we decrease the information in the data, and you see how marginal posteriors and correlations increase. For a moderate decrease of information in the data, you can still see the reduction of uncertainty in the posterior pair correlations, but if you further increase the information in your data, you get higher-order correlations and all looks random (it likely isn't though).

1 Hartig, F.; Dislich, C.; Wiegand, T. & Huth, A. (2014) Technical Note: Approximate Bayesian parameterization of a process-based tropical forest model. Biogeosciences, 11, 1261-1272. http://www.biogeosciences.net/11/1261/2014/bg-11-1261-2014.html

Addition: I thought for further reference, an illustration would be useful. What are are talking about is this type of correlation in posterior space:

library(psych)
par1= runif(1000,0,1)
par2 =par1 + rnorm(1000,sd = 0.05)
scatter.hist(par1,par2)

As one can see, the marginals look flat, but there is actually a large reduction in uncertainty in the two-dimensional space.

Solved – AIC and BIC criterion for Model selection, how is it used in this paper

In my answer here I show that in a case like the present one, in which we test nested models against each other, the minimum AIC rule selects the larger model (i.e., rejects the null) if the likelihood ratio statistic $$ \mathcal{LR}=n[\log(\widehat{\sigma}^2_1)-\log(\widehat{\sigma}^2_2)], $$ with $\widehat{\sigma}^2_i$ the ML error variance estimates of the restricted and unrestricted models, exceeds $2K_2$. Here, $K_2$ is the number of additional variables in the larger model. In your case, $K_2=1$, corresponding to $x_{i2}$. Thus, select the larger model if $\mathcal{LR}>2$.

Now, in the present linear regression framework, the absolute value of the $t$-statistic $$|t|=\left| \dfrac{\sqrt{n}\hat\beta(U) }{\sigma_\beta} \right|$$ is simply the positive square root of the LR-statistic.

(Actually, this in general only holds asymptotically, as we have $t^2=F$, the $F$- or Wald-statistic, which is in general not numerically identical to $\cal{LR}$ in finite samples. Leeb and Pötscher however assume that $\sigma^2$ is known, which, as is shown here, restores exact numerical equivalence of Wald, LR and score statistics in this setup.)

Hence, going with the larger model according to the mininum AIC rule when $\mathcal{LR}>2=c$ corresponds to rejecting when the t-statistic exceeds $\sqrt{c}$.

It is worth pointing out that this implies that, in this case, the AIC rule is nothing but a hypothesis test at level $\alpha=0.157$, as (the LR statistic being $\chi^2_1$ under the present $H_0$ of the smaller model being the correct one)

> 1-pchisq(2,df = 1)
[1] 0.1572992

> 2*pnorm(-sqrt(2))
[1] 0.1572992

Solving the equation $1.96=\sqrt{\ln n}$ for $n$ gives that BIC would be of the same size as a test at the 5%-level at $n\approx46$.

It does not seem to be a general result that AIC corresponds to a liberal nested hypothesis test. For example, when $K_2=8$, AIC is equivalent to rejecting when $\mathcal{LR}>16$, which, under the null, has probability

> 1-pchisq(2*8,df = 8)
[1] 0.04238011

In fact, the probability tends to zero with $K_2$:

Best Answer

Related Solutions

Solved – What would be the reason that the posterior distribution looks like the prior using MCMC

Solved – AIC and BIC criterion for Model selection, how is it used in this paper

Related Question