I am confused how to extract a reduced set of explanatory variables and their coefficients in one step when using stepAIC multiple regression. It looks as we need to fit a model first (step 1), then manually select significant variables (*, ** and ***) and fit the model with reduced variables the 2nd time (step 2). In other words:
Step 1:
fit<-lm(fundm ~ datam)
s1<-stepAIC(fit,direction="both")
Call:
lm(formula = fundm ~ datam)
Residuals:
Min 1Q Median 3Q Max
-0.0160190 -0.0033468 0.0003507 0.0031516 0.0185178
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0022594 0.0010541 -2.144 0.03354 *
datamArbitrage Hedge Fund Index 0.1244127 0.1498900 0.830 0.40772
datamAsia Arbitrage Hedge Fund Index -0.1124529 0.0635026 -1.771 0.07843 .
datamAsia Event Driven Hedge Fund Index -0.1129128 0.0504382 -2.239 0.02652 *
datamAsia Fixed Income Hedge Fund Index 0.0421230 0.0412566 1.021 0.30875
datamAsia Long Short Equities Hedge Fund Index -0.3682610 0.3006139 -1.225 0.22231
datamAsia Macro Hedge Fund Index -0.0061878 0.0112859 -0.548 0.58424
datamAsia Multi-Strategy Hedge Fund Index 0.0236778 0.0796421 0.297 0.76661
datamAsia Pacific Absolute Return Fund Index 0.1362870 0.0679360 2.006 0.04648 *
datamAsia Pacific Fund of Funds Index 0.0899765 0.1193044 0.754 0.45182
datamAsian Hedge Fund Index 0.4133575 0.3915550 1.056 0.29266
datamCTA/Managed Futures Hedge Fund Index -0.2906797 0.1853655 -1.568 0.11876
datamDistressed Debt Fund of Funds Index 0.2324322 0.0775045 2.999 0.00313 **
datamDistressed Debt Hedge Fund Index 0.2484346 0.0710966 3.494 0.00061 ***
datamEmerging Markets Fund of Funds Index 0.1353594 0.0989803 1.368 0.17332
datamEmerging Markets Hedge Fund Index -0.0325781 0.1085913 -0.300 0.76455
datamEmerging Markets Macro Hedge Fund Index 0.0425492 0.0636112 0.669 0.50450
datamEurope Fund of Funds Index 0.2360343 0.0863174 2.734 0.00693 **
Step 2
fit<-lm(fundm ~ col1+col3+col5, data=datam) #selecting only vars with p<0.05 - an example shown
s1<-stepAIC(fit,direction="both")
My questions:
1. Is there a way to select significant variables (let's say p<0.05) and their coefficients in one step?
-
If not, how can I automate the step 2, i.e. build a formula with only significant variables from the 1st step?
-
Is there another AIC regression package that would offer a fully automatic variable selection based on min AIC?
Thanks
Best Answer
The best answer to your question would be to say: don't do it. Stepwise selection is almost certain to give poor results that don't generalize well. This highly rated thread goes into exquisite detail about why this is a problem. It's a particular problem if you intend to do rolling multiple regression, as the choices among correlated predictors will tend to vary markedly as you proceed. Variability of choices among correlated predictors is also a problem with all feature-selection methods including LASSO and elastic net, but the penalizations imposed by those 2 methods improve their predictive performance in ways that unpenalized stepwise selection cannot match. Ridge regression (limiting case of elastic net, without feature selection) will tend to give fairly stable regression coefficients and might be better suited to automation if there aren't a very large number of predictors, as it tends to treat correlated predictors together.
If you nevertheless insist on using
stepAIC()
despite the overwhelming arguments against stepwise selection (as I used to before I saw the light), its help page says that "the stepwise-selected model is returned." Coding help is off topic here, but note that your formula using the entire dataframe meant that the function was not working in the environment of the dataframe itself but rather in its parent environment. That might have forced the function to report all columns of the dataframe, while it might do otherwise had you used a standarddata=datam
argument. You can view the function code by typingstepAIC
at the R prompt.