Solved – What exactly is “stepwise model selection”

aicmodel selectionmultiple regressionstepwise regression

Although the merits of stepwise model selection has been discussed previously, it is becoming unclear to me what exactly is "stepwise model selection" or "stepwise regression". I thought I understood it, but not so sure anymore.

My understanding is that these two terms are synonymous (at least in a regression context), and that they refer to the selection of the best set of predictor variables in an "optimal" or "best" model, given the data. (You can find the Wikipedia page here, and another potentially useful overview here.)

Based on several previous threads (for example here: Algorithms for automatic model selection), it appears that stepwise model selection is considered a cardinal sin. And yet, it seems to be used all the time, including by what seem to be well respected statisticians. Or am I mixing up the terminology?

My main questions are:

  1. By "stepwise model selection" or "stepwise regression", do we mean:
    A) doing sequential hypothesis testing such as likelihood ratio tests or looking at p-values? (There is a related post here: Why are p-values misleading after performing a stepwise selection?) Is this what is meant by it and why it is bad?
    Or
    B) do we also consider selection based on AIC (or similar information criterion) to be equally bad? From the answer at Algorithms for automatic model selection, it appears that this too is criticized. On the other hand, Whittingham et al. (2006; pdf)1 seems to suggest that variable selection based on information-theoretic (IT) approach is different from stepwise selection (and seems to be a valid approach)…?

    And this is the source of all my confusion.

    To follow up, if AIC based selection does fall under "stepwise" and is considered inappropriate, then here are additional questions:

  2. If this approach is wrong, why is it taught in textbooks, university courses, etc.? Is all that plain wrong?

  3. What are good alternatives for selecting which variables should remain in the model? I have come across recommendations to use cross-validation and training-test datasets, and LASSO.

  4. I think everyone can agree that indiscriminately throwing all possible variables into a model and then doing stepwise selection is problematic. Of course, some sane judgement should guide what goes in initially. But what if we already start with a limited number of possible predictor variables based on some (say biological) knowledge, and all these predictors may well be explaining our response? Would this approach of model selection still be flawed?
    I also acknowledge that selection of the "best" model might not be appropriate if AIC values among different models are very similar (and multi-model inference may be applied in such cases). But is the underlying issue of using AIC-based stepwise selection still problematic?

    If we are looking to see which variables seem to explain the response and in what way, why is this approach wrong, since we know "all models are wrong, but some are useful"?

1. Whittingham, M.J., Stephens, P.A., Bradbury, R.B., & Freckleton, R.P. (2006). Why do we still use stepwise modelling in ecology and behaviour? Journal of Animal Ecology, 75, pp. 1182–1189.

Best Answer

1) The reason you're confused is that the term "stepwise" is used inconsistently. Sometimes it means pretty specific procedures in which $p$-values of regression coefficients, calculated in the ordinary way, are used to determine what covariates are added to or removed from a model, and this process is repeated several times. It may refer to (a) a particular variation of this procedure in which variables can be added or removed at any step (I think this is what SPSS calls "stepwise"), or it may refer to (b) this variation along with other variations such as only adding variables or only removing variables. More broadly, "stepwise" can be used to refer to (c) any procedure in which features are added to or removed from a model according to some value that's computed each time a feature (or set of features) is added or removed.

These different strategies have all been criticized for various reasons. I would say that most of the criticism is about (b), the key part of that criticism is that $p$-values are poorly equipped for feature selection (the significance tests here are really testing something quite different from "should I include this variable in the model?"), and most serious statisticians recommend against it in all circumstances. (c) is more controversial.

2) Because statistics education is really bad. To give just one example: so far as I can tell from my own education, it's apparently considered a key part of statistics education for psychology majors to tell students to use Bessel's correction to get unbiased estimates of population SD. It's true that Bessel's correction makes the estimate of the variance unbiased, but it's easy to prove that the estimate of the SD is still biased. Better yet, Bessel's correction can increase the MSE of these estimates.

3) Variable selection is practically a field unto itself. Cross-validation and train–test splits are ways to evaluate a model, possibly after feature selection; they don't themselves provide suggestions for which features to use. The lasso is often a good choice. So is best subsets.

4) In my mind, there's still no sense in using (b), especially when you could do something else in (c) instead, like using AIC. I have no objections to AIC-based stepwise selection, but be aware that it's going to be sensitive to the sample (in particular, as samples grow arbitrarily large, AIC, like the lasso, always chooses the most complex model), so don't present the model selection itself as if it were a generalizable conclusion.

If we are looking to see which variables seem to explain the response and in what way

Ultimately, if you want to look at the effects of all the variables, you need to include all the variables, and if your sample is too small for that, you need a bigger sample. Remember, null hypotheses are never true in real life. There aren't going to be a bunch of variables that are associated with an outcome and a bunch of other variables that aren't. Every variable will be associated with the outcome—the questions are to what degree, in what direction, in what interactions with other variables, etc.