Solved – Using LASSO from lars (or glmnet) package in R for variable selection

feature selectionglmnetlarslasso

Sorry if this question comes across a little basic.

I am looking to use LASSO variable selection for a multiple linear regression model in R. I have 15 predictors, one of which is categorical(will that cause a problem?). After setting my $x$ and $y$ I use the following commands:

model = lars(x, y)
coef(model)

My problem is when I use coef(model). This returns a matrix with 15 rows, with one extra predictor added each time. However there is no suggestion as to which model to choose. Have I missed something? Is there a way I can get the lars package to return just one "best" model?

There are other posts suggesting using glmnet instead but this seems more complicated. An attempt is as follows, using the same $x$ and $y$. Have I missed something here?:

cv = cv.glmnet(x, y)
model = glmnet(x, y, type.gaussian="covariance", lambda=cv$lambda.min)
predict(model, type="coefficients")

The final command returns a list of my variables, the majority with a coefficient although some are =0. Is this the correct choice of the "best" model selected by LASSO? If I then fit a linear model with all my variables which had coefficients not=0 I get very similar, but slightly different, coefficient estimates. Is there a reason for this difference? Would it be acceptable to refit the linear model with these variables chosen by LASSO and take that as my final model? Otherwise I cannot see any p-values for significance. Have I missed anything?

Does

type.gaussian="covariance" 

ensure that that glmnet uses multiple linear regression?

Does the automatic normalisation of the variables affect the coefficients at all? Is there any way to include interaction terms in a LASSO procedure?

I am looking to use this procedure more as a demonstration of how LASSO can be used than for any model that will actually be used for any important inference/prediction if that changes anything.

Thank you for taking the time to read this. Any general comments on LASSO/lars/glmnet would also be greatly appreciated.

Best Answer

Using glmnet is really easy once you get the grasp of it thanks to its excellent vignette in http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html (you can also check the CRAN package page). As for the best lambda for glmnet, the rule of thumb is to use

cvfit <- glmnet::cv.glmnet(x, y)
coef(cvfit, s = "lambda.1se")

instead of lambda.min.

To do the same for lars you have to do it by hand. Here is my solution

cv <- lars::cv.lars(x, y, plot.it = FALSE, mode = "step")
idx <- which.max(cv$cv - cv$cv.error <= min(cv$cv))
coef(lars::lars(x, y))[idx,]

Bear in mind that this is not exactly the same, because this is stopping at a lasso knot (when a variable enters) instead of at any point.

Please note that glmnet is the preferred package now, it is actively maintained, more so than lars, and that there have been questions about glmnet vs lars answered before (algorithms used differ).

As for your question of using lasso to choose variables and then fit OLS, it is an ongoing debate. Google for OLS post Lasso and there are some papers discussing the topic. Even the authors of Elements of Statistical Learning admit it is possible.

Edit: Here is the code to reproduce more accurately what glmnet does in lars

  cv <- lars::cv.lars(x, y, plot.it = FALSE)
  ideal_l1_ratio <- cv$index[which.max(cv$cv - cv$cv.error <= min(cv$cv))]
  obj <- lars::lars(x, y)
  scaled_coefs <- scale(obj$beta, FALSE, 1 / obj$normx)
  l1 <- apply(X = scaled_coefs, MARGIN = 1, FUN = function(x) sum(abs(x)))
  coef(obj)[which.max(l1 / tail(l1, 1) > ideal_l1_ratio),]