Solved – How to obtain Confidence Intervals for a LASSO regression

confidence intervalglmnetlassoregression

I'm very new from R. I have this code for a LASSO regression:

X <- X <- as.matrix(read.csv2("DB_LASSO_ERP.csv"))
y <- read.csv2("OUTCOME_LASSO_ERP.csv",header=F)$V1
fit <- glmnet(x = X, y = y, family = "binomial", alpha = 1)
crossval <- cv.glmnet(x = X, y = y, family = "binomial")
penalty <- crossval$lambda.min
fit1 <- glmnet(x = X, y = y, family = "binomial", alpha = 1, lambda = penalty)

I want to obtain Confidence Intervals for this coefficients. How can I do? Can you help me with the script please? I have very few experience with R.
Thanks!

Best Answer

Please think very carefully about why you want confidence intervals for the LASSO coefficients and how you will interpret them. This is not an easy problem.

The predictors chosen by LASSO (as for any feature-selection method) can be highly dependent on the data sample at hand. You can examine this in your own data by repeating your LASSO model-building procedure on multiple bootstrap samples of the data. If you have predictors that are correlated with each other, the specific predictors chosen by LASSO are likely to differ among models based on the different bootstrap samples. So what do you mean by a confidence interval for a coefficient for a predictor, say predictor $x_1$, if $x_1$ wouldn't even have been chosen by LASSO if you had worked with a different sample from the same population?

The quality of predictions from a LASSO model is typically of more interest than are confidence intervals for the individual coefficients. Despite the instability in feature selection, LASSO-based models can be useful for prediction. The selection of 1 from among several correlated predictors might be somewhat arbitrary, but the 1 selected serves as a rough proxy for the others and thus can lead to valid predictions. You can test the performance of your LASSO approach by seeing how well the models based on multiple bootstrapped samples work on the full original data set.

That said, there is recent work on principled ways to obtain confidence intervals and on related issues in inference after LASSO. This page and its links is a good place to start. The issues are discussed in more detail in Section 6.3 of Statistical Learning with Sparsity. There is also a package selectiveInference in R that implements these methods. But these are based on specific assumptions that might not hold in your data. If you do choose to use this approach, make sure to understand the conditions under which the approach is valid and exactly what those confidence intervals really mean. That statistical issue, rather than the R coding issue, is what is crucial here.