Solved – The workflow of using stepAIC of MASS

multiple regressionstepwise regression

I am confused how to extract a reduced set of explanatory variables and their coefficients in one step when using stepAIC multiple regression. It looks as we need to fit a model first (step 1), then manually select significant variables (*, ** and ***) and fit the model with reduced variables the 2nd time (step 2). In other words:

Step 1:

fit<-lm(fundm ~ datam)
s1<-stepAIC(fit,direction="both")
Call:
lm(formula = fundm ~ datam)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0160190 -0.0033468  0.0003507  0.0031516  0.0185178 

Coefficients:
                                                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)                                          -0.0022594  0.0010541  -2.144  0.03354 *  
datamArbitrage Hedge Fund Index                       0.1244127  0.1498900   0.830  0.40772    
datamAsia Arbitrage Hedge Fund Index                 -0.1124529  0.0635026  -1.771  0.07843 .  
datamAsia Event Driven Hedge Fund Index              -0.1129128  0.0504382  -2.239  0.02652 *  
datamAsia Fixed Income Hedge Fund Index               0.0421230  0.0412566   1.021  0.30875    
datamAsia Long Short Equities Hedge Fund Index       -0.3682610  0.3006139  -1.225  0.22231    
datamAsia Macro Hedge Fund Index                     -0.0061878  0.0112859  -0.548  0.58424    
datamAsia Multi-Strategy Hedge Fund Index             0.0236778  0.0796421   0.297  0.76661    
datamAsia Pacific Absolute Return Fund Index          0.1362870  0.0679360   2.006  0.04648 *  
datamAsia Pacific Fund of Funds Index                 0.0899765  0.1193044   0.754  0.45182    
datamAsian Hedge Fund Index                           0.4133575  0.3915550   1.056  0.29266    
datamCTA/Managed Futures Hedge Fund Index            -0.2906797  0.1853655  -1.568  0.11876    
datamDistressed Debt Fund of Funds Index              0.2324322  0.0775045   2.999  0.00313 ** 
datamDistressed Debt Hedge Fund Index                 0.2484346  0.0710966   3.494  0.00061 ***
datamEmerging Markets Fund of Funds Index             0.1353594  0.0989803   1.368  0.17332    
datamEmerging Markets Hedge Fund Index               -0.0325781  0.1085913  -0.300  0.76455    
datamEmerging Markets Macro Hedge Fund Index          0.0425492  0.0636112   0.669  0.50450    
datamEurope Fund of Funds Index                       0.2360343  0.0863174   2.734  0.00693 **

Step 2

fit<-lm(fundm ~ col1+col3+col5, data=datam) #selecting only vars with p<0.05 - an example shown
s1<-stepAIC(fit,direction="both")

My questions:
1. Is there a way to select significant variables (let's say p<0.05) and their coefficients in one step?

If not, how can I automate the step 2, i.e. build a formula with only significant variables from the 1st step?
Is there another AIC regression package that would offer a fully automatic variable selection based on min AIC?

Thanks

Best Answer

The best answer to your question would be to say: don't do it. Stepwise selection is almost certain to give poor results that don't generalize well. This highly rated thread goes into exquisite detail about why this is a problem. It's a particular problem if you intend to do rolling multiple regression, as the choices among correlated predictors will tend to vary markedly as you proceed. Variability of choices among correlated predictors is also a problem with all feature-selection methods including LASSO and elastic net, but the penalizations imposed by those 2 methods improve their predictive performance in ways that unpenalized stepwise selection cannot match. Ridge regression (limiting case of elastic net, without feature selection) will tend to give fairly stable regression coefficients and might be better suited to automation if there aren't a very large number of predictors, as it tends to treat correlated predictors together.

If you nevertheless insist on using stepAIC() despite the overwhelming arguments against stepwise selection (as I used to before I saw the light), its help page says that "the stepwise-selected model is returned." Coding help is off topic here, but note that your formula using the entire dataframe meant that the function was not working in the environment of the dataframe itself but rather in its parent environment. That might have forced the function to report all columns of the dataframe, while it might do otherwise had you used a standard data=datam argument. You can view the function code by typing stepAIC at the R prompt.

Related Solutions

Solved – Assessing the effect of adding a variable using stepwise forward logistic regression using Stata

I am assuming you know that the stepwise regression is a wrong approach (see Frank Harrell's terrific book, or just wait for his comments in this thread), and you are ready to face the criticism of the reviewers (or your dissertation committee, depending on your career stage). I am thus treating this as a programming exercise, rather than a rigorous methodological investigation.

Stata stepwise command does not support factor variables, as you have probably discovered already, so you'd have to rewrite its main functionality, at least at a descriptive level. I will make use of Ben Jann's estadd command published in Stata Journal.

    net sj 7-2 st0085_1
    net install st0085_1
    webuse nlswork, clear
    foreach catvar of varlist race grade ind_code occ_code {
      regress ln_wage age i.`catvar'
      levelsof `catvar', local( thelevels )
      tokenize `thelevels'
      local dotcat
      while "`1'"!="" {
        local dotcat `dotcat' `1'.`catvar'
        macro shift
      }
      test `dotcat'
      estadd scalar pnew = r(p)
      estimates store with_`catvar'
    }
    estimates tab with_* , stats( pnew )

The last line gives you the answers (not terribly informative in this case, of course, as the sample sizes are quite a bit larger than yours).

Feel free to ask about specific commands in this code fragment. Of course, you'd modify this for your own data and estimation command of your liking. The above code assumes Stata 11 and factor variables; you have not stated what version of Stata you are using, which would've helped.

Solved – using stepAIC of MASS package to select variables with a significance level of 5% in R project

stepAIC from MASS package or step from stats package functions uses AIC or BIC criteria for selecting variable (Model Selection). You can use forward or backward function from mixlm package, where you can specify the cutoff point of p-value to include and exclude.

Hope this will help you.

Best Answer

Related Solutions

Solved – Assessing the effect of adding a variable using stepwise forward logistic regression using Stata

Solved – using stepAIC of MASS package to select variables with a significance level of 5% in R project

Related Question