I am trying to understand the basic difference between stepwise and backward regression in R using the step function.
For stepwise regression I used the following command
step(lm(mpg~wt+drat+disp+qsec,data=mtcars),direction="both")
I got the below output for the above code.
For backward variable selection I used the following command
step(lm(mpg~wt+drat+disp+qsec,data=mtcars),direction="backward")
And I got the below output for backward
As much as I have understood, when no parameter is specified, stepwise selection acts as backward unless the parameter "upper" and "lower" are specified in R. Yet in the output of stepwise selection, there is a +disp that is added in the 2nd step. What is the function trying to achieve by adding the +disp again in the stepwise selection? Why is R adding the +disp in the 2nd step whereas the results are the same (AIC values and model selection values) as the backward selection. How is R exactly working in the stepwise selection?
I really want to understand how this function is working in R.
Thanks in advance for the help!
Best Answer
Perhaps it would be easier to understand how stepwise regression is being done by looking at all 15 possible lm models.
Here's a quickie to generate formula for all 15 combinations.
AIC values for each of the model are extracted with:
Let's go back to your step-regression. The extractAIC value for lm(mpg ~ wt + drat + disp + qsec) is 65.63 (equivalent to model 15 in the list above).
If the model remove disp (-disp), then lm(mpg ~ wt + drat + qsec) is 63.891 (or model 11 in the list).
If the model do not remove anything (none), then the AIC is still 65.63
If the model remove qsec (-qsec), then lm(mpg ~ wt + drat + disp) is 65.908 (model 12).
etc.
Basically the summary reveal the all possible stepwise removal of one-term from your full model and compare the extractAIC value, by listing them in ascending order. Since the smaller AIC value is more likely to resemble the TRUTH model, step retain the (-disp) model in step one.
The process is repeated again, but with the retained (-disp) model as the starting point. Terms are either subtracted ("backwards") or subtracted/added ("both") to allow the comparison of the models. Since the lowest AIC value in comparison is still the (-disp) model, process stop and resultant models given.
With regards to your query: "What is the function trying to achieve by adding the +disp again in the stepwise selection?", in this case, it doesn't really do anything, cos the best model across all 15 models is model 11, i.e. lm(mpg ~ wt + drat + qsec).
However, in complicated models with large number of predictors that require numerous steps to resolve, the adding back of a term that was removed initially is critical to provide the most exhaustive way of comparing the terms.
Hope this help in some way.