Solved – using stepAIC of MASS package to select variables with a significance level of 5% in R project

multiple regressionrstepwise regression

First of all, sorry i am new about this and any helps are really welcome.

I am reading a reaserch paper where the authors report: Stepwise forward regression (Zar 1996) was used to select the most informative variables, which were included in a multiple (linear) regression model. A 5% significance level was chosen as a threshold for the inclusion of the model variables.

with a private email the first author told me that the variable selection was performed using stepAIC of MASS library using direction "forward" and they considered only for the final model the variables with a significance level of < 5%.

using junk data i tried to rewrite the analysis in order to understand the procedure

state.x77
st = as.data.frame(state.x77) str(st) colnames(st)[4] = "Life.Exp"
colnames(st)[6] = "HS.Grad" st[,9] = st$Population * 1000 / st$Area colnames(st)[9] = "Density"
str(st) model1 = lm(Life.Exp ~ Population + Income + Illiteracy + Murder + + HS.Grad + Frost + Area + Density, data=st)
model1.stepAIC <- stepAIC(model1, direction=c("both"))
summary(model1.stepAIC)


Call:
lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost, 
    data = st)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.47095 -0.53464 -0.03701  0.57621  1.50683 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.103e+01  9.529e-01  74.542  < 2e-16 ***
Population   5.014e-05  2.512e-05   1.996  0.05201 .  
Murder      -3.001e-01  3.661e-02  -8.199 1.77e-10 ***
HS.Grad      4.658e-02  1.483e-02   3.142  0.00297 ** 
Frost       -5.943e-03  2.421e-03  -2.455  0.01802 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7197 on 45 degrees of freedom
Multiple R-squared:  0.736,     Adjusted R-squared:  0.7126 
F-statistic: 31.37 on 4 and 45 DF,  p-value: 1.696e-12

followint the protocol of the paper the final model is

Life.Exp ~ Murder + HS.Grad + Frost (final model) 

because Population is > 0.05.

I wish to know if this final model approach is correct, and then:

fmodel = lm(Life.Exp ~ Murder + HS.Grad + Frost, data=st)
summary(fmodel)

Call:
lm(formula = Life.Exp ~ Murder + HS.Grad + Frost, data = st)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5015 -0.5391  0.1014  0.5921  1.2268 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 71.036379   0.983262  72.246  < 2e-16 ***
Murder      -0.283065   0.036731  -7.706 8.04e-10 ***
HS.Grad      0.049949   0.015201   3.286  0.00195 ** 
Frost       -0.006912   0.002447  -2.824  0.00699 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7427 on 46 degrees of freedom
Multiple R-squared:  0.7127,    Adjusted R-squared:  0.6939 
F-statistic: 38.03 on 3 and 46 DF,  p-value: 1.634e-12

Best Answer

stepAIC from MASS package or step from stats package functions uses AIC or BIC criteria for selecting variable (Model Selection). You can use forward or backward function from mixlm package, where you can specify the cutoff point of p-value to include and exclude.

Hope this will help you.