Solved – Double lasso variable selection

feature selectioninstrumental-variableslasso

Currently I am learning about variable selection and lasso. I found the paper by Urminsky et al. "Using Double-Lasso Regression for Principled Variable Selection" (2016) which proposes a double lasso variable selection process to identify important IVs and a powerful subset of variables.

It seems to be pretty easy to implement. The following steps are proposed:

Lasso regression of all covariates on DV, to find direct relations between covariates and DV.
Lasso regression of all covariates on IV, to find direct relations between covariates and the focal IV.
Linear regression of all identified important covariates (step 1+2) and focal IV on DV.

Repeat step two to include more focal IVs.

I already asked on cross validated if fitting a normal regression subsequent to a lasso would make sense, and received the answer that this wouldn't be good practice (heres the thread: Lasso for "cherry picking").

What do you think about the double lasso variable selection method?

Best Answer

A major advantage of the double selection method is that it is heteroskedasticity robust. They showed that this is true even if the selection is not perfect.

'We propose robust methods for inference about the effect of a treatment variable on a scalar outcome in the presence of very many regressors in a model with possibly non-Gaussian and heteroscedastic disturbances.'

'The main attractive feature of our method is that it allows for imperfect selection of the controls and provides confidence intervals that are valid uniformly across a large class of models. In contrast, standard post-model selection estimators fail to provide uniform inference even in simple cases with a small, fixed number of controls. '

[Belloni et. al.][1] https://academic.oup.com/restud/article-abstract/81/2/608/1523757?redirectedFrom=fulltext

Related Solutions

Solved – Why is variable selection necessary

Variable selection (without penalization) only makes things worse. Variable selection has almost no chance of finding the "right" variables, and results in large overstatements of effects of remaining variables and huge understatement of standard errors. It is a mistake to believe that variable selection done in the usual way helps one get around the "large p small n" problem. The bottom line is the the final model is misleading in every way. This is related to an astounding statement I read in an epidemiology paper: "We didn't have an adequate sample size to develop a multivariable model, so instead we performed all possible tests for 2x2 tables."

Any time the dataset at hand is used to eliminate variables, while making use of Y to make the decision, all statistical quantities will be distorted. Typical variable selection is a mirage.

Edit: (Copying comments from below hidden by the fold)

I don't want to be self-serving but my book Regression Modeling Strategies goes into this in some depth. Online materials including handouts may be found at my webpage. Some available methods are $L_2$ penalization (ridge regression), $L_1$ penalization (lasso), and the so-called elastic net (combination of $L_1$ and $L_2$). Or use data reduction (blinded to the response $Y$) before doing regression. My book spends more space on this than on penalization.

Solved – Using LASSO from lars (or glmnet) package in R for variable selection

Using glmnet is really easy once you get the grasp of it thanks to its excellent vignette in http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html (you can also check the CRAN package page). As for the best lambda for glmnet, the rule of thumb is to use

cvfit <- glmnet::cv.glmnet(x, y)
coef(cvfit, s = "lambda.1se")

instead of lambda.min.

To do the same for lars you have to do it by hand. Here is my solution

cv <- lars::cv.lars(x, y, plot.it = FALSE, mode = "step")
idx <- which.max(cv$cv - cv$cv.error <= min(cv$cv))
coef(lars::lars(x, y))[idx,]

Bear in mind that this is not exactly the same, because this is stopping at a lasso knot (when a variable enters) instead of at any point.

Please note that glmnet is the preferred package now, it is actively maintained, more so than lars, and that there have been questions about glmnet vs lars answered before (algorithms used differ).

As for your question of using lasso to choose variables and then fit OLS, it is an ongoing debate. Google for OLS post Lasso and there are some papers discussing the topic. Even the authors of Elements of Statistical Learning admit it is possible.

Edit: Here is the code to reproduce more accurately what glmnet does in lars

  cv <- lars::cv.lars(x, y, plot.it = FALSE)
  ideal_l1_ratio <- cv$index[which.max(cv$cv - cv$cv.error <= min(cv$cv))]
  obj <- lars::lars(x, y)
  scaled_coefs <- scale(obj$beta, FALSE, 1 / obj$normx)
  l1 <- apply(X = scaled_coefs, MARGIN = 1, FUN = function(x) sum(abs(x)))
  coef(obj)[which.max(l1 / tail(l1, 1) > ideal_l1_ratio),]

Best Answer

Related Solutions

Solved – Why is variable selection necessary

Solved – Using LASSO from lars (or glmnet) package in R for variable selection

Related Question