Solved – Does BIC try to find a true model

aicbicmodel selection

This question is a follow-up or attempt to clear up possible confusion regarding a topic I and many others find a bit difficult, regarding the difference between AIC and BIC. In a very nice answer by @Dave Kellen on this topic (https://stats.stackexchange.com/a/767/30589) we read:

Your question implies that AIC and BIC try to answer the same
question, which is not true. AIC tries to select the model that most
adequately describes an unknown, high dimensional reality. This means
that reality is never in the set of candidate models that are being
considered. On the contrary, BIC tries to find the TRUE model among
the set of candidates. I find it quite odd the assumption that reality
is instantiated in one of the model that the researchers built along
the way. This is a real issue for BIC.

In a comment below, by @gui11aume , we read:

(-1) Great explanation, but I would like to challenge an assertion.
@Dave Kellen Could you please give a reference to where the idea that
the TRUE model has to be in the set for BIC? I would like to
investigate on this, since in this book the authors give a convincing
proof that this is not the case. – gui11aume May 27 '12 at 21:47

It seems that this assertion comes from Schwarz himself (1978), although the assertion was not necessary: By the same authors (as @gui11aume links to), we read from their article "Multimodel inference: Understanding AIC and BIC in Model selection" (Burnham and Anderson, 2004):

Does the derivation of BIC assume the existence of a true model, or,
more narrowly, is the true model assumed to be in the model set when
using BIC? (Schwarz's derivation specified these conditions.) … The
answer … no. That is, BIC (as the basis for an approximation to a
certain Bayesian integral) can be derived without assuming that the
model underlying the derivation is true (see, e.g. Cavanaugh and Neath
1999; Burnham and Anderson 2002:293-5). Certainly, in applying BIC,
the model set need not contain the (noexistent) true model
representing full reality. Moreover, the convergence in probability of
the BIC-selected model to a targbet model (under the idealization of
an iid sample) does not logically mean that that target model must be
the true data-generating distribution).

So, I think it is worth a discussion or some clarification (if more is needed) on this subject. Right now, all we have is a comment from @gui11aume (thank you!) under a very highly voted answer regarding the difference between AIC and BIC.

Best Answer

The Information Criterion by Schwarz (1978) was designed with the feature that it asymptotically chooses the model with the higher posterior odds, i.e. the model with the higher likelihood given the data under equal priors. So roughly $$ \frac{p(M_1|y)}{p(M_2|y)} > 1 \overset{A}{\sim} SIC(M_1) < SIC(M_2) $$ where $\overset{A}{\sim}$ denotes "asymptotically equivalent" and $p(M_j|y)$ is the posterior of model $j$ given data $y$. I do not see how this result would depend on model 1 being true (is there even a true model in a Bayesian framework?).

What I think is responsible for the confusion is that the SIC has the other nice feature that, under certain conditions, it will asymptotically select the "true" model if the latter is within the model universe. Both AIC and SIC are special cases of the criterion $$ IC(k) = -\frac{2}{T} \mathcal{l}(\hat{\theta};y) + k g(T) $$ where $\mathcal{l}(\hat{\theta};y)$ is the log likelihood of the parameter estimates $\hat{\theta}$, $k$ is the number of parameters and $T$ is the sample size. When the model universe consists of linear, Gaussian models, it can be shown that we need: $$ g(T) \to 0 \; \text{as} \;\infty $$ for the IC not to select a model that is smaller than the true model with probability one and $$ Tg(T) \to \infty \; \text{as} \;\infty $$ for the IC not to select a model that is larger than the true model with probability one. We have that $$ g_{AIC}(T) = \frac{2}{T},\;\; g_{SIC}(T) = \frac{\ln{T}}{T} $$ So SIC fulfills both conditions while AIC fulfills the first, but not the second condition. For a very accessible exposition of these features and a discussion of practical implications, see Chapter 6 of this book.

Elliott, G. and A. Timmermann (2016, April). Economic Forecasting. Princeton University Press.

Schwarz, Gideon. "Estimating the dimension of a model." The annals of statistics 6.2 (1978): 461-464.