Regression – Using Lasso for Feature Selection Followed by a Non-Regularized Regression

feature selectionlassologisticregression

I use Lasso logistic regression in order to identify a smaller subset of important variables. I start with N=51 (28/23) and 32 predictors.

So far it looks pretty promising, because I can identify four important predictors in my optimal model.

Now I would like to take those four predictors and examine them along with some control variables in a standard logistic regression.

My question is, does that analysis strategy make sense? Is there a better way to include controls or other variables that might be interesting?

For a better understanding:

  1. Identify important variables via Lasso logistic regression

  2. Do further analysis including identified predictors and other control variables using standard logistic regression (using AIC to check model fit)

Best Answer

Note that there exist multiple iterative LASSO procedures, so in general, it is not necessarily true that you should stick with the first LASSO estimates.

For example:

  • Post-LASSO-OLS: see Belloni, Chernozhukov (2013) Least squares after model selection in high-dimensional sparse models, Bernoulli 19(2), 2013, 521–547. Also known as the LASSO-OLS hybrid (Efron et al 2004, Least angle regression. Annals of Statistics 32 407–451)

  • Adaptive LASSO (Zou 2006), eventually multiple stages (Bühlman, Meier 2008). Two-stages (or more), both using a CV procedure, the second step using a modified (re-weighted) penalty.

  • Relaxed LASSO (Meinshausen 2007), on a bunch of subsets computed by initial LASSO

Now in general, I would use one of these procedures to decide whether or not to add more variables, instead of a BIC model selection procedure.