Solved – Conflicting approaches to variable selection: AIC, p-values or both

aicfeature selectionhypothesis testingmodel selectionmultiple regression

From what I understand, variable selection based on p-values (at least in regression context) is highly flawed. It appears variable selection based on AIC (or similar) is also considered flawed by some, for similar reasons, although this seems a bit unclear (e.g. see my question and some links on this topic here: What exactly is "stepwise model selection"?).

But say you do go for one of these two methods to choose the best set of predictors in your model.

Burnham and Anderson 2002 (Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach, page 83) state that one should not mix variable selection based on AIC with that based on hypothesis testing: "Tests of null hypotheses and information-theoretic approaches should not be used together; they are very different analysis paradigms."

On the other hand, Zuur et al. 2009 (Mixed Effects Models With Extensions in Ecology with R, page 541) seem to advocate the use of AIC to first find the optimal model, and then perform "fine tuning" using hypothesis testing: "The disadvantage is that the AIC can be conservative, and you may need to apply some fine tuning (using hypothesis testing procures from approach one) once the AIC has selected an optimal model."

You can see how this leaves the reader of both books confused over which approach to follow.

1) Are these just different "camps" of statistical thinking and a topic of disagreement among statisticians? Is one of these approaches simply "outdated" now, but was considered appropriate at the time of writing? Or is one just plain wrong from the start?

2) Would there be a scenario in which this approach would be appropriate? For example, I come from a biological background, where I am often trying to determine which, if any, variables seem to affect or drive my response. I often have a number of candidate explanatory variables and I am trying to find which are "important" (in relative terms). Also, note that the set of candidate predictor variables is already reduced to those considered to have some biological relevance, but this may still include 5-20 candidate predictors.

Best Answer

A short answer.

The approach of doing data-driven model selection or tuning, then using standard inferential methods on the selected/tuned model (à la Zuur et al., and many other respected ecologists such as Crawley), will always give overoptimistic results: overly narrow confidence intervals (poor coverage), overly small p-values (high type I error). This is because standard inferential methods assume the model is specified a priori; they don't take the model tuning process into account.

This is why researchers like Frank Harrell (Regression Modeling Strategies) strongly disapprove of data-driven selection techniques like stepwise regression, and caution that one must do any reduction of the model complexity ("dimension reduction", e.g. computing a PCA of the predictor variables and selecting the first few PCA axes as predictors) by looking only at the predictor variables.

If you are interested only in finding the best predictive model (and aren't interested in any kind of reliable estimate of the uncertainty of your prediction, which falls in the realm of inference!), then data-driven model tuning is fine (although stepwise selection is rarely the best available option); machine learning/statistical learning algorithms do a lot of tuning to try to get the best predictive model. The "test" or "out-of-sample" error must be assessed on a separate, held-out sample, or any tuning methods need to be built into a cross-validation procedure.

There does seem to have been historical evolution in opinions on this topic; many classic statistical textbooks, especially those that focus on regression, present stepwise approaches followed by standard inferential procedures without taking the effects of model selection into account [citation needed ...]

There are many ways to quantify variable importance, and not all fall into the post-variable-selection trap.

  • Burnham and Anderson recommend summing AIC weights; there's quite a bit of disagreement over this approach.
  • You could fit the full model (with appropriately scaled/unitless predictors) and rank the predictors by estimated magnitude [biological effect size] or Z-score ["clarity"/statistical effect size].