Solved – Stepwise regression in R – How does it work

rregression

I am trying to understand the basic difference between stepwise and backward regression in R using the step function.
For stepwise regression I used the following command

  step(lm(mpg~wt+drat+disp+qsec,data=mtcars),direction="both")

I got the below output for the above code.

forward

For backward variable selection I used the following command

 step(lm(mpg~wt+drat+disp+qsec,data=mtcars),direction="backward")

And I got the below output for backward

backward

As much as I have understood, when no parameter is specified, stepwise selection acts as backward unless the parameter "upper" and "lower" are specified in R. Yet in the output of stepwise selection, there is a +disp that is added in the 2nd step. What is the function trying to achieve by adding the +disp again in the stepwise selection? Why is R adding the +disp in the 2nd step whereas the results are the same (AIC values and model selection values) as the backward selection. How is R exactly working in the stepwise selection?

I really want to understand how this function is working in R.
Thanks in advance for the help!

Best Answer

Perhaps it would be easier to understand how stepwise regression is being done by looking at all 15 possible lm models.

Here's a quickie to generate formula for all 15 combinations.

library(leaps)
tmp<-regsubsets(mpg ~ wt + drat + disp + qsec, data=mtcars, nbest=1000, really.big=T, intercept=F)
all.mods <- summary(tmp)[[1]]
all.mods <- lapply(1:nrow(all.mods), function(x) as.formula(paste("mpg~", paste(names(which(all.mods[x,])), collapse="+"))))

head(all.mods)
[[1]]
mpg ~ drat
<environment: 0x0000000013a678d8>

[[2]]
mpg ~ qsec
<environment: 0x0000000013a6b3b0>

[[3]]
mpg ~ wt
<environment: 0x0000000013a6df28>

[[4]]
mpg ~ disp
<environment: 0x0000000013a70aa0>

[[5]]
mpg ~ wt + qsec
<environment: 0x0000000013a74540>

[[6]]
mpg ~ drat + disp
<environment: 0x0000000013a76f68>

AIC values for each of the model are extracted with:

all.lm<-lapply(all.mods, lm, mtcars)

sapply(all.lm, extractAIC)[2,]
 [1]  97.98786 111.77605  73.21736  77.39732  63.90843  77.92493  74.15591  79.02978  91.24052  71.35572
[11]  63.89108  65.90826  78.68074  72.97352  65.62733

Let's go back to your step-regression. The extractAIC value for lm(mpg ~ wt + drat + disp + qsec) is 65.63 (equivalent to model 15 in the list above).

If the model remove disp (-disp), then lm(mpg ~ wt + drat + qsec) is 63.891 (or model 11 in the list).

If the model do not remove anything (none), then the AIC is still 65.63

If the model remove qsec (-qsec), then lm(mpg ~ wt + drat + disp) is 65.908 (model 12).

etc.

Basically the summary reveal the all possible stepwise removal of one-term from your full model and compare the extractAIC value, by listing them in ascending order. Since the smaller AIC value is more likely to resemble the TRUTH model, step retain the (-disp) model in step one.

The process is repeated again, but with the retained (-disp) model as the starting point. Terms are either subtracted ("backwards") or subtracted/added ("both") to allow the comparison of the models. Since the lowest AIC value in comparison is still the (-disp) model, process stop and resultant models given.

With regards to your query: "What is the function trying to achieve by adding the +disp again in the stepwise selection?", in this case, it doesn't really do anything, cos the best model across all 15 models is model 11, i.e. lm(mpg ~ wt + drat + qsec).

However, in complicated models with large number of predictors that require numerous steps to resolve, the adding back of a term that was removed initially is critical to provide the most exhaustive way of comparing the terms.

Hope this help in some way.