Solved – Linear model predictor selection. Which method to use

dimensionality reductionmodel selectionregularizationstepwise regression

From what I understand, there are 3 main types of predictor selection method for linear models, namely, 1 Subset Selection, 2 Shrinkage and 3 Dimension Reduction.

  1. The subset selection includes the Best Subset Selection and the
    Stepwise Selection which could be forward, backward or hybrid. AIC,
    BIC, Cp or Adjusted R-Square can be used to select the predictors.

  2. The Shrinkage includes the Ridge Regression and Lasso. This approach
    attempts to shrink the coefficients to 0

  3. Dimension reduction transforms the predictors and fit the model
    using the transformed predictor.

If forecasting accuracy is my main goal and model interpretability is not important, which method(s) should I use ?
What should I do if the methods give inconsistent results ?
What are the main advantages and disadvantages of each approach ?

Best Answer

This is a rather broad question.

First, I do not think Ridge regression shrinks coefficient to 0. It does not create sparsity so if you want to do feature selection it will be useless. You should consider the lasso instead or the elasticnet (which is a mix of ridge and lasso since a penalty L1 and a L2 one are added to the minimisation problem).

If your goal is really to select variables, have a look at the stability selection from Meinshausen and Bulhmann. The concept is to bootstrap and do Lasso regression. It uses the fact that there is a homotopic solution (meaning each coefficient in Lasso regression has a piecewise continuous solution path). Starting with a very high penalty and decreasing it step by step you have each coefficient being not null one by one. Now if you do that several times you can have a probability of not being null for each coefficient (meaning variable selected or not) for each penalty value.

This would be a good method if you have a lot of variables because Lasso can be seen as a convex relaxation of subset selection. So it is usually faster.

Dimension reduction (PCA for example) may not be designed to get a better performance accuracy because it is often unsupervised. See http://metaoptimize.com/qa/questions/9338/how-to-use-pca-for-classification for a more precise discussion on that subject.