Variable selection (without penalization) only makes things worse. Variable selection has almost no chance of finding the "right" variables, and results in large overstatements of effects of remaining variables and huge understatement of standard errors. It is a mistake to believe that variable selection done in the usual way helps one get around the "large p small n" problem. The bottom line is the the final model is misleading in every way. This is related to an astounding statement I read in an epidemiology paper: "We didn't have an adequate sample size to develop a multivariable model, so instead we performed all possible tests for 2x2 tables."
Any time the dataset at hand is used to eliminate variables, while making use of Y to make the decision, all statistical quantities will be distorted. Typical variable selection is a mirage.
Edit: (Copying comments from below hidden by the fold)
I don't want to be self-serving but my book Regression Modeling Strategies goes into this in some depth. Online materials including handouts may be found at my webpage. Some available methods are $L_2$ penalization (ridge regression), $L_1$ penalization (lasso), and the so-called elastic net (combination of $L_1$ and $L_2$). Or use data reduction (blinded to the response $Y$) before doing regression. My book spends more space on this than on penalization.
Using glmnet
is really easy once you get the grasp of it thanks to its excellent vignette in http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html (you can also check the CRAN package page).
As for the best lambda for glmnet
, the rule of thumb is to use
cvfit <- glmnet::cv.glmnet(x, y)
coef(cvfit, s = "lambda.1se")
instead of lambda.min
.
To do the same for lars
you have to do it by hand. Here is my solution
cv <- lars::cv.lars(x, y, plot.it = FALSE, mode = "step")
idx <- which.max(cv$cv - cv$cv.error <= min(cv$cv))
coef(lars::lars(x, y))[idx,]
Bear in mind that this is not exactly the same, because this is stopping at a lasso knot (when a variable enters) instead of at any point.
Please note that glmnet
is the preferred package now, it is actively maintained, more so than lars
, and that there have been questions about glmnet
vs lars
answered before (algorithms used differ).
As for your question of using lasso to choose variables and then fit OLS, it is an ongoing debate. Google for OLS post Lasso and there are some papers discussing the topic. Even the authors of Elements of Statistical Learning admit it is possible.
Edit: Here is the code to reproduce more accurately what glmnet
does in lars
cv <- lars::cv.lars(x, y, plot.it = FALSE)
ideal_l1_ratio <- cv$index[which.max(cv$cv - cv$cv.error <= min(cv$cv))]
obj <- lars::lars(x, y)
scaled_coefs <- scale(obj$beta, FALSE, 1 / obj$normx)
l1 <- apply(X = scaled_coefs, MARGIN = 1, FUN = function(x) sum(abs(x)))
coef(obj)[which.max(l1 / tail(l1, 1) > ideal_l1_ratio),]
Best Answer
A major advantage of the double selection method is that it is heteroskedasticity robust. They showed that this is true even if the selection is not perfect.
'We propose robust methods for inference about the effect of a treatment variable on a scalar outcome in the presence of very many regressors in a model with possibly non-Gaussian and heteroscedastic disturbances.'
'The main attractive feature of our method is that it allows for imperfect selection of the controls and provides confidence intervals that are valid uniformly across a large class of models. In contrast, standard post-model selection estimators fail to provide uniform inference even in simple cases with a small, fixed number of controls. '
[Belloni et. al.][1] https://academic.oup.com/restud/article-abstract/81/2/608/1523757?redirectedFrom=fulltext