Model Selection – Modern Alternatives to Stepwise Regression

generalized linear modelmodel selectionregressionstepwise regression

I have a dataset with around 30 independent variables and would like to construct a generalized linear model (GLM) to explore the relationship between them and the dependent variable.

I am aware that the method I was taught for this situation, stepwise regression, is now considered a statistical sin.

What modern methods of model selection should be used in this situation?

Best Answer

There are several alternatives to Stepwise Regression. The most used I have seen are:

  • Expert opinion to decide which variables to include in the model.
  • Partial Least Squares Regression. You essentially get latent variables and do a regression with them. You could also do PCA yourself and then use the principal variables.
  • Least Absolute Shrinkage and Selection Operator (LASSO).

Both PLS Regression and LASSO are implemented in R packages like

PLS: http://cran.r-project.org/web/packages/pls/ and

LARS: http://cran.r-project.org/web/packages/lars/index.html

If you only want to explore the relationship between your dependent variable and the independent variables (e.g. you do not need statistical significance tests), I would also recommend Machine Learning methods like Random Forests or Classification/Regression Trees. Random Forests can also approximate complex non-linear relationships between your dependent and independent variables, which might not have been revealed by linear techniques (like Linear Regression).

A good starting point to Machine Learning might be the Machine Learning task view on CRAN:

Machine Learning Task View: http://cran.r-project.org/web/views/MachineLearning.html

Related Question