First of all, sorry i am new about this and any helps are really welcome.
I am reading a reaserch paper where the authors report: Stepwise forward regression (Zar 1996) was used to select the most informative variables, which were included in a multiple (linear) regression model. A 5% significance level was chosen as a threshold for the inclusion of the model variables.
with a private email the first author told me that the variable selection was performed using stepAIC of MASS library using direction "forward" and they considered only for the final model the variables with a significance level of < 5%.
using junk data i tried to rewrite the analysis in order to understand the procedure
state.x77
st = as.data.frame(state.x77) str(st) colnames(st)[4] = "Life.Exp"
colnames(st)[6] = "HS.Grad" st[,9] = st$Population * 1000 / st$Area colnames(st)[9] = "Density"
str(st) model1 = lm(Life.Exp ~ Population + Income + Illiteracy + Murder + + HS.Grad + Frost + Area + Density, data=st)
model1.stepAIC <- stepAIC(model1, direction=c("both"))
summary(model1.stepAIC)
Call:
lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost,
data = st)
Residuals:
Min 1Q Median 3Q Max
-1.47095 -0.53464 -0.03701 0.57621 1.50683
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.103e+01 9.529e-01 74.542 < 2e-16 ***
Population 5.014e-05 2.512e-05 1.996 0.05201 .
Murder -3.001e-01 3.661e-02 -8.199 1.77e-10 ***
HS.Grad 4.658e-02 1.483e-02 3.142 0.00297 **
Frost -5.943e-03 2.421e-03 -2.455 0.01802 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7197 on 45 degrees of freedom
Multiple R-squared: 0.736, Adjusted R-squared: 0.7126
F-statistic: 31.37 on 4 and 45 DF, p-value: 1.696e-12
followint the protocol of the paper the final model is
Life.Exp ~ Murder + HS.Grad + Frost (final model)
because Population is > 0.05.
I wish to know if this final model approach is correct, and then:
fmodel = lm(Life.Exp ~ Murder + HS.Grad + Frost, data=st)
summary(fmodel)
Call:
lm(formula = Life.Exp ~ Murder + HS.Grad + Frost, data = st)
Residuals:
Min 1Q Median 3Q Max
-1.5015 -0.5391 0.1014 0.5921 1.2268
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.036379 0.983262 72.246 < 2e-16 ***
Murder -0.283065 0.036731 -7.706 8.04e-10 ***
HS.Grad 0.049949 0.015201 3.286 0.00195 **
Frost -0.006912 0.002447 -2.824 0.00699 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7427 on 46 degrees of freedom
Multiple R-squared: 0.7127, Adjusted R-squared: 0.6939
F-statistic: 38.03 on 3 and 46 DF, p-value: 1.634e-12
Best Answer
stepAIC
fromMASS
package orstep
fromstats
package functions uses AIC or BIC criteria for selecting variable (Model Selection). You can useforward
orbackward
function frommixlm
package, where you can specify the cutoff point of p-value to include and exclude.Hope this will help you.