Solved – Stepwise regression in R – How does it work

rregression

I am trying to understand the basic difference between stepwise and backward regression in R using the step function.
For stepwise regression I used the following command

  step(lm(mpg~wt+drat+disp+qsec,data=mtcars),direction="both")

I got the below output for the above code.

For backward variable selection I used the following command

 step(lm(mpg~wt+drat+disp+qsec,data=mtcars),direction="backward")

And I got the below output for backward

As much as I have understood, when no parameter is specified, stepwise selection acts as backward unless the parameter "upper" and "lower" are specified in R. Yet in the output of stepwise selection, there is a +disp that is added in the 2nd step. What is the function trying to achieve by adding the +disp again in the stepwise selection? Why is R adding the +disp in the 2nd step whereas the results are the same (AIC values and model selection values) as the backward selection. How is R exactly working in the stepwise selection?

I really want to understand how this function is working in R.
Thanks in advance for the help!

Best Answer

Perhaps it would be easier to understand how stepwise regression is being done by looking at all 15 possible lm models.

Here's a quickie to generate formula for all 15 combinations.

library(leaps)
tmp<-regsubsets(mpg ~ wt + drat + disp + qsec, data=mtcars, nbest=1000, really.big=T, intercept=F)
all.mods <- summary(tmp)[[1]]
all.mods <- lapply(1:nrow(all.mods), function(x) as.formula(paste("mpg~", paste(names(which(all.mods[x,])), collapse="+"))))

head(all.mods)
[[1]]
mpg ~ drat
<environment: 0x0000000013a678d8>

[[2]]
mpg ~ qsec
<environment: 0x0000000013a6b3b0>

[[3]]
mpg ~ wt
<environment: 0x0000000013a6df28>

[[4]]
mpg ~ disp
<environment: 0x0000000013a70aa0>

[[5]]
mpg ~ wt + qsec
<environment: 0x0000000013a74540>

[[6]]
mpg ~ drat + disp
<environment: 0x0000000013a76f68>

AIC values for each of the model are extracted with:

all.lm<-lapply(all.mods, lm, mtcars)

sapply(all.lm, extractAIC)[2,]
 [1]  97.98786 111.77605  73.21736  77.39732  63.90843  77.92493  74.15591  79.02978  91.24052  71.35572
[11]  63.89108  65.90826  78.68074  72.97352  65.62733

Let's go back to your step-regression. The extractAIC value for lm(mpg ~ wt + drat + disp + qsec) is 65.63 (equivalent to model 15 in the list above).

If the model remove disp (-disp), then lm(mpg ~ wt + drat + qsec) is 63.891 (or model 11 in the list).

If the model do not remove anything (none), then the AIC is still 65.63

If the model remove qsec (-qsec), then lm(mpg ~ wt + drat + disp) is 65.908 (model 12).

etc.

Basically the summary reveal the all possible stepwise removal of one-term from your full model and compare the extractAIC value, by listing them in ascending order. Since the smaller AIC value is more likely to resemble the TRUTH model, step retain the (-disp) model in step one.

The process is repeated again, but with the retained (-disp) model as the starting point. Terms are either subtracted ("backwards") or subtracted/added ("both") to allow the comparison of the models. Since the lowest AIC value in comparison is still the (-disp) model, process stop and resultant models given.

With regards to your query: "What is the function trying to achieve by adding the +disp again in the stepwise selection?", in this case, it doesn't really do anything, cos the best model across all 15 models is model 11, i.e. lm(mpg ~ wt + drat + qsec).

However, in complicated models with large number of predictors that require numerous steps to resolve, the adding back of a term that was removed initially is critical to provide the most exhaustive way of comparing the terms.

Hope this help in some way.

Related Solutions

Solved – Selection of regressors

I ran the forward and exhaustive algorithms on the data set that I am working right now and found out the plots to be different.

leaps = regsubsets(orders_rcvd~., data=data[,var_cols], nbest=1, method="forward")
plot(leaps)
leaps = regsubsets(orders_rcvd~., data=data[,var_cols], nbest=1, method="exhaustive")
plot(leaps)

I am guessing that for your dataset the models selected by the forward search for each of the number of the variables is the same the best subset for each of the number of variables by the best-subset algorithm.

Question 2 and 3 : The algorithm does work in the way you mentioned. But the plots are not the visual representation of the path the algorithm has taken. Its y-axis is sorted by the Adjusted-R^2, in your case

Solved – Backward Stepwise Selection

Set up an exit criterion for the p-value. Any independent variable with a p-value higher than this criterion will be removed. There isn't any golden rule on what to set, for exploratory purpose you may see p > 0.2 being removed. Someone may used 0.05, etc.
Set up your full model. Generally, it's the model that contains all independent variables from which you wish to select the predictive bunch.
Fit the full model.
Check the p-values (or t-statistics). If all p-values are less than the exit criterion, then it's the final model. If any of them exceeds the exit criterion, then the one with the highest p-value (aka lowest t-statistics) will be removed. This is pertinent to your "dropping the variable with the smallest z-score."
Fit the model with the remaining independent variables again, repeat steps 4 and 5 until either no independent variable is left or no independent variable has a p-value larger than the exit criterion.

Here is an example using Stata:

. sysuse auto
. stepwise, pr(.2): reg mpg weight turn headroom foreign price
                      begin with full model
p = 0.9238 >= 0.2000  removing price
p = 0.7047 >= 0.2000  removing headroom
p = 0.2045 >= 0.2000  removing turn

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  2,    71) =   69.75
       Model |   1619.2877     2  809.643849           Prob > F      =  0.0000
    Residual |  824.171761    71   11.608053           R-squared     =  0.6627
-------------+------------------------------           Adj R-squared =  0.6532
       Total |  2443.45946    73  33.4720474           Root MSE      =  3.4071

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |  -.0065879   .0006371   -10.34   0.000    -.0078583   -.0053175
     foreign |  -1.650029   1.075994    -1.53   0.130      -3.7955    .4954422
       _cons |    41.6797   2.165547    19.25   0.000     37.36172    45.99768
------------------------------------------------------------------------------

The independent variables are weight, turn, headroom, foreign, and price. The dependent variable is mpg. The exit criterion is p > 0.2. The first got removed was price (p = 0.9238), followed by headroom (p = 0.7047) and turn (p = 0.2045). The remaining ones have p < 0.2, so they stay.

Best Answer

Related Solutions

Solved – Selection of regressors

Solved – Backward Stepwise Selection

Related Question