Solved – Using LASSO for variable selection, then using Logit

lassologitmodel selection

I know this would muddy the statistical inference, but I am really only concerned with getting as close to an accurate model as I can.

I have a dichotomous outcome variable, with a large set of dichotomous predictors. I am thinking I would like to try using LASSO to select which variables I should include in my model, then input those selected variables in to a Logit regression.

Is there something I am overlooking when it comes to the practicality of this approach?

Best Answer

There is a package in R called glmnet that can fit a LASSO logistic model for you! This will be more straightforward than the approach you are considering. More precisely, glmnet is a hybrid between LASSO and Ridge regression but you may set a parameter $\alpha=1$ to do a pure LASSO model. Since you are interested in logistic regression you will set family="binomial".

Related Solutions

Solved – Using LASSO from lars (or glmnet) package in R for variable selection

Using glmnet is really easy once you get the grasp of it thanks to its excellent vignette in http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html (you can also check the CRAN package page). As for the best lambda for glmnet, the rule of thumb is to use

cvfit <- glmnet::cv.glmnet(x, y)
coef(cvfit, s = "lambda.1se")

instead of lambda.min.

To do the same for lars you have to do it by hand. Here is my solution

cv <- lars::cv.lars(x, y, plot.it = FALSE, mode = "step")
idx <- which.max(cv$cv - cv$cv.error <= min(cv$cv))
coef(lars::lars(x, y))[idx,]

Bear in mind that this is not exactly the same, because this is stopping at a lasso knot (when a variable enters) instead of at any point.

Please note that glmnet is the preferred package now, it is actively maintained, more so than lars, and that there have been questions about glmnet vs lars answered before (algorithms used differ).

As for your question of using lasso to choose variables and then fit OLS, it is an ongoing debate. Google for OLS post Lasso and there are some papers discussing the topic. Even the authors of Elements of Statistical Learning admit it is possible.

Edit: Here is the code to reproduce more accurately what glmnet does in lars

  cv <- lars::cv.lars(x, y, plot.it = FALSE)
  ideal_l1_ratio <- cv$index[which.max(cv$cv - cv$cv.error <= min(cv$cv))]
  obj <- lars::lars(x, y)
  scaled_coefs <- scale(obj$beta, FALSE, 1 / obj$normx)
  l1 <- apply(X = scaled_coefs, MARGIN = 1, FUN = function(x) sum(abs(x)))
  coef(obj)[which.max(l1 / tail(l1, 1) > ideal_l1_ratio),]

Best Answer

Related Solutions

Solved – Using LASSO from lars (or glmnet) package in R for variable selection

Related Question