Solved – Does Akaike information criterion penalize model complexity any more than is necessary to avoid overfitting

model selection

The AIC penalizes complex models. Obviously a certain penalty for complex models is necessary to avoid overfitting of statistical models: otherwise we would favour a model which is simply a copy of the data itself, and that would tell us nothing.

Does the AIC penalize complexity any more than is strictly necessary to avoid this occurance, however?

Edit:

As an example of what I mean, wiki page for Occam's Razor references a textbook showing that a presumption in favour of simpler models is not required in Bayesian stats:

There have also been other attempts to derive Occam's Razor from probability theory, including notable attempts made by Harold Jeffreys and E. T. Jaynes. The probabilistic (Bayesian) basis for Occam's Razor is elaborated by David J. C. MacKay in chapter 28 of his book Information Theory, Inference, and Learning Algorithms,[32] where he emphasises that a prior bias in favour of simpler models is not required.

Is the AIC merely reflecting the frequentist equivalent of the same effect, or does it favour simplicity over and above that?

Best Answer

As Marc stated, this question depends on how you define more than strictly necessary. It may also depend on what you mean by overfitting. If you mean selecting a model that contains parameters that are not in the "true" model, then the penalty is correct asymptotically. There is a corrected version of AIC for smaller samples, but if you have small sample sizes you may want to run simulations based on what your goal is for your model to determine the best approach. This paper provides a pretty readable discussion of AIC and BIC for model selection: http://www.sortie-nd.org/lme/Statistical%20Papers/Burnham_and_Anderson_2004_Multimodel_Inference.pdf

If you mean overfitting in the sense that your out of sample predictions are worse than optimal, then AIC may or may not penalize appropriately. From Applied Econometric Time Series by Enders: "forecasts using overly parsimonious models with little parameter uncertainty can provide better forecasts than models consistent with the actual data-generating process."