Solved – The workflow of using stepAIC of MASS

multiple regressionstepwise regression

I am confused how to extract a reduced set of explanatory variables and their coefficients in one step when using stepAIC multiple regression. It looks as we need to fit a model first (step 1), then manually select significant variables (*, ** and ***) and fit the model with reduced variables the 2nd time (step 2). In other words:

Step 1:

fit<-lm(fundm ~ datam)
s1<-stepAIC(fit,direction="both")
Call:
lm(formula = fundm ~ datam)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0160190 -0.0033468  0.0003507  0.0031516  0.0185178 

Coefficients:
                                                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)                                          -0.0022594  0.0010541  -2.144  0.03354 *  
datamArbitrage Hedge Fund Index                       0.1244127  0.1498900   0.830  0.40772    
datamAsia Arbitrage Hedge Fund Index                 -0.1124529  0.0635026  -1.771  0.07843 .  
datamAsia Event Driven Hedge Fund Index              -0.1129128  0.0504382  -2.239  0.02652 *  
datamAsia Fixed Income Hedge Fund Index               0.0421230  0.0412566   1.021  0.30875    
datamAsia Long Short Equities Hedge Fund Index       -0.3682610  0.3006139  -1.225  0.22231    
datamAsia Macro Hedge Fund Index                     -0.0061878  0.0112859  -0.548  0.58424    
datamAsia Multi-Strategy Hedge Fund Index             0.0236778  0.0796421   0.297  0.76661    
datamAsia Pacific Absolute Return Fund Index          0.1362870  0.0679360   2.006  0.04648 *  
datamAsia Pacific Fund of Funds Index                 0.0899765  0.1193044   0.754  0.45182    
datamAsian Hedge Fund Index                           0.4133575  0.3915550   1.056  0.29266    
datamCTA/Managed Futures Hedge Fund Index            -0.2906797  0.1853655  -1.568  0.11876    
datamDistressed Debt Fund of Funds Index              0.2324322  0.0775045   2.999  0.00313 ** 
datamDistressed Debt Hedge Fund Index                 0.2484346  0.0710966   3.494  0.00061 ***
datamEmerging Markets Fund of Funds Index             0.1353594  0.0989803   1.368  0.17332    
datamEmerging Markets Hedge Fund Index               -0.0325781  0.1085913  -0.300  0.76455    
datamEmerging Markets Macro Hedge Fund Index          0.0425492  0.0636112   0.669  0.50450    
datamEurope Fund of Funds Index                       0.2360343  0.0863174   2.734  0.00693 **

Step 2

fit<-lm(fundm ~ col1+col3+col5, data=datam) #selecting only vars with p<0.05 - an example shown
s1<-stepAIC(fit,direction="both")

My questions:
1. Is there a way to select significant variables (let's say p<0.05) and their coefficients in one step?

  1. If not, how can I automate the step 2, i.e. build a formula with only significant variables from the 1st step?

  2. Is there another AIC regression package that would offer a fully automatic variable selection based on min AIC?

Thanks

Best Answer

The best answer to your question would be to say: don't do it. Stepwise selection is almost certain to give poor results that don't generalize well. This highly rated thread goes into exquisite detail about why this is a problem. It's a particular problem if you intend to do rolling multiple regression, as the choices among correlated predictors will tend to vary markedly as you proceed. Variability of choices among correlated predictors is also a problem with all feature-selection methods including LASSO and elastic net, but the penalizations imposed by those 2 methods improve their predictive performance in ways that unpenalized stepwise selection cannot match. Ridge regression (limiting case of elastic net, without feature selection) will tend to give fairly stable regression coefficients and might be better suited to automation if there aren't a very large number of predictors, as it tends to treat correlated predictors together.

If you nevertheless insist on using stepAIC() despite the overwhelming arguments against stepwise selection (as I used to before I saw the light), its help page says that "the stepwise-selected model is returned." Coding help is off topic here, but note that your formula using the entire dataframe meant that the function was not working in the environment of the dataframe itself but rather in its parent environment. That might have forced the function to report all columns of the dataframe, while it might do otherwise had you used a standard data=datam argument. You can view the function code by typing stepAIC at the R prompt.

Related Question