Solved – Testing for coefficients significance in Lasso logistic regression

lassologisticregression coefficientsselectiveinferencestatistical significance

[A similar question was asked here with no answers]

I have fit a logistic regression model with L1 regularization (Lasso logistic regression) and I would like to test the fitted coefficients for significance and get their p-values. I know Wald's tests (for instance) are an option to test the significance of individual coefficients in full regression without regularization, but with Lasso I think further problems arise which do not allow to apply the usual Wald formulas. For instance, the variance estimates neded for the test do not follow the usual expressions. The original Lasso paper:

Regression Shrinkage and Selection via the Lasso

suggests a bootstrap-based procedure to estimate the coefficients variance, which (again, I think) may be needed for the tests (section 2.5, last paragraph of page 272 and beginning of 273):

One approach is via the bootstrap: either $t$ can be fixed or we may optimize over $t$ for each bootstrap sample. Fixing $t$ is analogous to selecting the best subset (of features) and then using the least squares standard error for that subset

What I understand is: fit a Lasso regression repeatedly to the whole dataset until we find the optimal value for the regularization parameter (this is not part of the bootstrap), and then use only the features selected by the Lasso to fit OLS regressions to subsamples of the data and apply the usual formulas to compute the variances from each of those regressions. (And then what should I do with all those variances of each coefficient to get the final variance estimate of each coefficient?)

Furthermore, is it correct to use the usual significance tests (for instance Wald's test which makes use of the estimated betas and variances) with the Lasso estimates of the coefficients and the bootstrap-estimated variances? I am fairly sure it is not, but any help (use a different test, use a more straightforward approach, whaterever…) is more than welcome.

According to the answers here I suspect inference and p-values just cannot be obtained. In my case, p-values are an external requirement (although the use of L1 regularization was my choice).

Thanks a lot

EDIT
What if I fit an OLS logistic regression using only the variables selected by a previous run of the Lasso logistic regression? Apparently (see here),

There's no need to run the model again after doing cross-validation (you just get the coefficients from the output of cv.glmnet), and in fact if you fit the new logistic regression model without penalisation then you're defeating the purpose of using lasso

But what if I do this with the sole purpose of being able to compute p-values while keeping the number of variables low? Is it a very dirty approach? 🙂

Best Answer

The problem with using the usual significance tests is that they assume the null that is that there are random variables, with no relationship with the outcome variables. However what you have with lasso, is a bunch of random variables, from which you select the best ones with the lasso, also the betas are shrunk. So you cannot use it, the results will be biased.

As far as I know, the bootstrap is not used to get the variance estimation, but to get the probabilities of a variable is selected. And those are your p-values. Check Hassie's free book, Statistical Learning with Sparsity, chapter 6 is talking about the same thing. Statistical Learning with Sparsity: The Lasso and Generalizations

Also check this paper for some other ways to get p-values from lasso: High-Dimensional Inference: Confidence Intervals, p-Values and R-Software hdi.There are probably more.