Solved – Paradox in model selection (AIC, BIC, to explain or to predict?)

aicbicmodel selectionparadoxregression

Having read Galit Shmueli's "To Explain or to Predict" (2010) and some literature on model selection using AIC and BIC, I am puzzled by an apparent contradiction. There are three premises,

  1. AIC- versus BIC-based model choice (end of p. 300 – start of p. 301): simply put, AIC should be used for selecting a model intended for prediction while BIC should be used for selecting a model for explanation. Additionally (not in the above paper), we know that under some conditions BIC selects the true model among the set of candidate models; the true model is what we seek in explanatory modelling (end of p. 293).
  2. Simple arithmetics: AIC will select a larger model than BIC for samples of size 8 or larger (satisfying $\text{ln}(n)>2$ due to the different complexity penalties in AIC versus BIC).
  3. The true model (i.e. the model with the correct regressors and the correct functional form but imperfectly estimated coefficients) may not be the best model for prediction (p. 307): a regression model with a missing predictor may be a better forecasting model — the introduction of bias due to the missing predictor may be outweighted by the reduction in variance due to estimation imprecision.

Points 1. and 2. suggest that larger-than-true models may be better for prediction than more parsimonious models. Meanwhile, point 3. gives an opposite example where a more parsimonious model is better for prediction than a larger, true model. I find this puzzling.

Questions:

  1. How can the apparent contradiction between points {1. and 2.} and 3. be explained/resolved?
  2. In light of point 3., could you give an intuitive explanation for why and how a larger model selected by AIC is actually better for prediction than a more parsimonious model selected by BIC?

I am not saying there is a contradiction in Shmueli (2010), I am just trying to understand an apparent paradox.

Best Answer

I will try to explain what's going on with some materials that I am referring to and what I have learned with personal correspondence with the author of the materials.

http://homepages.cwi.nl/~pdg/presentations/RSShandout.pdf

Above is an example where we are trying to infer a 3rd degree polynomial plus noise. If you look at the bottom left quadrant, you will see that on a cumulative basis AIC beats BIC on a 1000 sample horizon. However you can also see that up to sample 100, instantaneous risk of AIC is worse that BIC. This is due to the fact that AIC is a bad estimator for small samples (a suggested fix is AICc). 0-100 is the region where "To Explain or To Predict" paper is demonstrating without a clear explanation of what's going on. Also even though it is not clear from the picture when the number of samples become large (the slopes become almost identical) BIC instantaneous risk outperforms AIC because the true model is in the search space. However at this point the ML estimates are so much concentrated around their true values that the overfitting of AIC becomes irrelevant as the extra model parameters are very very close to 0. So as you can see from the top-right quadrant AIC identifies on average a polynomial degree of ~3.2 (over many simulation runs it sometimes identifies a degree of 3 sometimes 4). However that extra parameter is minuscule, which makes AIC a no-brainer against BIC.

The story is not that simple however. There are several confusions in papers treating AIC and BIC. Two scenarios to be considered:

1) The model that is searched for is static/fixed, and we increase the number of samples and see what happens under different methodologies.

a) The true model is in search space. We covered this case above.

b) The true model is not in search space but can be approximated with the functional form we are using. In this case AIC is also superior.

http://homepages.cwi.nl/~pdg/presentations/RSShandout.pdf (page 9)

c) The true model is not in search space and we are not even close to getting in right with an approximation. According to Prof. Grunwald, we don't know what's going on under this scenario.

2) The number of samples are fixed, and we vary the model to be searched for to understand the effects of model difficulty under different methodologies.

Prof. Grunwald provides the following example. The truth is say a distribution with a parameter $\theta = \sqrt{(\log n) / n}$ where n is the sample size. And the candidate model 1 is $\theta = 0$ and candidate model 2 is a distribution with a free parameter $\theta^*$. BIC always selects model 1, however model 2 always predicts better because the ML estimate is closer to $\theta$ than 0. As you can see BIC is not finding the truth and and also predicting worse at the same time.

There is also the non-parametric case, but I don't have much information on that front.

My personal opinion is that all the information criteria are approximations and one should not expect a correct result in all cases. I also believe that the model that predicts best is also the model that explains best. It is because when people use the term "model" they don't involve the values of the parameters just the number the parameters. But if you think of it as a point hypothesis then the information content of the protested extra parameters are virtually zero. That's why I would always choose AIC over BIC, if I am left with only those options.

Related Question