Solved – What exactly is “stepwise model selection”

aicmodel selectionmultiple regressionstepwise regression

Although the merits of stepwise model selection has been discussed previously, it is becoming unclear to me what exactly is "stepwise model selection" or "stepwise regression". I thought I understood it, but not so sure anymore.

My understanding is that these two terms are synonymous (at least in a regression context), and that they refer to the selection of the best set of predictor variables in an "optimal" or "best" model, given the data. (You can find the Wikipedia page here, and another potentially useful overview here.)

Based on several previous threads (for example here: Algorithms for automatic model selection), it appears that stepwise model selection is considered a cardinal sin. And yet, it seems to be used all the time, including by what seem to be well respected statisticians. Or am I mixing up the terminology?

My main questions are:

By "stepwise model selection" or "stepwise regression", do we mean:
A) doing sequential hypothesis testing such as likelihood ratio tests or looking at p-values? (There is a related post here: Why are p-values misleading after performing a stepwise selection?) Is this what is meant by it and why it is bad?
Or
B) do we also consider selection based on AIC (or similar information criterion) to be equally bad? From the answer at Algorithms for automatic model selection, it appears that this too is criticized. On the other hand, Whittingham et al. (2006; pdf)¹ seems to suggest that variable selection based on information-theoretic (IT) approach is different from stepwise selection (and seems to be a valid approach)…?

And this is the source of all my confusion.

To follow up, if AIC based selection does fall under "stepwise" and is considered inappropriate, then here are additional questions:
If this approach is wrong, why is it taught in textbooks, university courses, etc.? Is all that plain wrong?
What are good alternatives for selecting which variables should remain in the model? I have come across recommendations to use cross-validation and training-test datasets, and LASSO.
I think everyone can agree that indiscriminately throwing all possible variables into a model and then doing stepwise selection is problematic. Of course, some sane judgement should guide what goes in initially. But what if we already start with a limited number of possible predictor variables based on some (say biological) knowledge, and all these predictors may well be explaining our response? Would this approach of model selection still be flawed?
I also acknowledge that selection of the "best" model might not be appropriate if AIC values among different models are very similar (and multi-model inference may be applied in such cases). But is the underlying issue of using AIC-based stepwise selection still problematic?

If we are looking to see which variables seem to explain the response and in what way, why is this approach wrong, since we know "all models are wrong, but some are useful"?

_{1. Whittingham, M.J., Stephens, P.A., Bradbury, R.B., & Freckleton, R.P. (2006). Why do we still use stepwise modelling in ecology and behaviour? Journal of Animal Ecology, 75, pp. 1182–1189.}

Best Answer

1) The reason you're confused is that the term "stepwise" is used inconsistently. Sometimes it means pretty specific procedures in which $p$-values of regression coefficients, calculated in the ordinary way, are used to determine what covariates are added to or removed from a model, and this process is repeated several times. It may refer to (a) a particular variation of this procedure in which variables can be added or removed at any step (I think this is what SPSS calls "stepwise"), or it may refer to (b) this variation along with other variations such as only adding variables or only removing variables. More broadly, "stepwise" can be used to refer to (c) any procedure in which features are added to or removed from a model according to some value that's computed each time a feature (or set of features) is added or removed.

These different strategies have all been criticized for various reasons. I would say that most of the criticism is about (b), the key part of that criticism is that $p$-values are poorly equipped for feature selection (the significance tests here are really testing something quite different from "should I include this variable in the model?"), and most serious statisticians recommend against it in all circumstances. (c) is more controversial.

2) Because statistics education is really bad. To give just one example: so far as I can tell from my own education, it's apparently considered a key part of statistics education for psychology majors to tell students to use Bessel's correction to get unbiased estimates of population SD. It's true that Bessel's correction makes the estimate of the variance unbiased, but it's easy to prove that the estimate of the SD is still biased. Better yet, Bessel's correction can increase the MSE of these estimates.

3) Variable selection is practically a field unto itself. Cross-validation and train–test splits are ways to evaluate a model, possibly after feature selection; they don't themselves provide suggestions for which features to use. The lasso is often a good choice. So is best subsets.

4) In my mind, there's still no sense in using (b), especially when you could do something else in (c) instead, like using AIC. I have no objections to AIC-based stepwise selection, but be aware that it's going to be sensitive to the sample (in particular, as samples grow arbitrarily large, AIC, like the lasso, always chooses the most complex model), so don't present the model selection itself as if it were a generalizable conclusion.

If we are looking to see which variables seem to explain the response and in what way

Ultimately, if you want to look at the effects of all the variables, you need to include all the variables, and if your sample is too small for that, you need a bigger sample. Remember, null hypotheses are never true in real life. There aren't going to be a bunch of variables that are associated with an outcome and a bunch of other variables that aren't. Every variable will be associated with the outcome—the questions are to what degree, in what direction, in what interactions with other variables, etc.

Related Solutions

Solved – AIC or p-value: which one to choose for model selection

AIC is a goodness of fit measure that favours smaller residual error in the model, but penalises for including further predictors and helps avoiding overfitting. In your second set of models model 1 (the one with the lowest AIC) may perform best when used for prediction outside your dataset. A possible explanation why adding Var4 to model 2 results in a lower AIC, but higher p values is that Var4 is somewhat correlated with Var1, 2 and 3. The interpretation of model 2 is thus easier.

Solved – Algorithms for automatic model selection

I think this approach is mistaken, but perhaps it will be more helpful if I explain why. Wanting to know the best model given some information about a large number of variables is quite understandable. Moreover, it is a situation in which people seem to find themselves regularly. In addition, many textbooks (and courses) on regression cover stepwise selection methods, which implies that they must be legitimate. Unfortunately however, they are not, and the pairing of this situation and goal are quite difficult to successfully navigate. The following is a list of problems with automated stepwise model selection procedures (attributed to Frank Harrell, and copied from here):

It yields R-squared values that are badly biased to be high.

The F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution.

The method yields confidence intervals for effects and predicted values that are falsely narrow; see Altman and Andersen (1989).

It yields p-values that do not have the proper meaning, and the proper correction for them is a difficult problem.

It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large; see Tibshirani [1996]).

It has severe problems in the presence of collinearity.

It is based on methods (e.g., F tests for nested models) that were intended to be used to test prespecified hypotheses.

Increasing the sample size does not help very much; see Derksen and Keselman (1992).

It allows us to not think about the problem.

It uses a lot of paper.

The question is, what's so bad about these procedures / why do these problems occur? Most people who have taken a basic regression course are familiar with the concept of regression to the mean, so this is what I use to explain these issues. (Although this may seem off-topic at first, bear with me, I promise it's relevant.)

Imagine a high school track coach on the first day of tryouts. Thirty kids show up. These kids have some underlying level of intrinsic ability to which neither the coach, nor anyone else, has direct access. As a result, the coach does the only thing he can do, which is have them all run a 100m dash. The times are presumably a measure of their intrinsic ability and are taken as such. However, they are probabilistic; some proportion of how well someone does is based on their actual ability and some proportion is random. Imagine that the true situation is the following:

set.seed(59)
intrinsic_ability = runif(30, min=9, max=10)
time = 31 - 2*intrinsic_ability + rnorm(30, mean=0, sd=.5)

The results of the first race are displayed in the following figure along with the coach's comments to the kids.

first race

Note that partitioning the kids by their race times leaves overlaps on their intrinsic ability--this fact is crucial. After praising some, and yelling at some others (as coaches tend to do), he has them run again. Here are the results of the second race with the coach's reactions (simulated from the same model above):

second race

Notice that their intrinsic ability is identical, but the times bounced around relative to the first race. From the coach's point of view, those he yelled at tended to improve, and those he praised tended to do worse (I adapted this concrete example from the Kahneman quote listed on the wiki page), although actually regression to the mean is a simple mathematical consequence of the fact that the coach is selecting athletes for the team based on a measurement that is partly random.

Now, what does this have to do with automated (e.g., stepwise) model selection techniques? Developing and confirming a model based on the same dataset is sometimes called data dredging. Although there is some underlying relationship amongst the variables, and stronger relationships are expected to yield stronger scores (e.g., higher t-statistics), these are random variables and the realized values contain error. Thus, when you select variables based on having higher (or lower) realized values, they may be such because of their underlying true value, error, or both. If you proceed in this manner, you will be as surprised as the coach was after the second race. This is true whether you select variables based on having high t-statistics, or low intercorrelations. True, using the AIC is better than using p-values, because it penalizes the model for complexity, but the AIC is itself a random variable (if you run a study several times and fit the same model, the AIC will bounce around just like everything else). Unfortunately, this is just a problem intrinsic to the epistemic nature of reality itself.

I hope this is helpful.

Best Answer

Related Solutions

Solved – AIC or p-value: which one to choose for model selection

Solved – Algorithms for automatic model selection

Related Question