Solved – Interpret and compare lasso models

glmnetlassologisticregression

I'm using lasso logistic regression in order to identify important variables and make inferences. For that I deploy glmnet and repeated cross validation to identify the best tuning parameters lambda.

In the first step I build a model including only my control variables.
In the second step I add other predictors and compare the selected variables, fit, and performance with the first model. Below are the measures of the two models.

controls (CV_error| % deviance): 1.320194 | 15.26

full model (CV_error| % deviance): 1.3705 | 14.97

I identify a best lambda for each of the two models (0.03|0.09). The first column measures the mean cross validation error in binomial deviance. The second column measures % of deviance explained compared to the intercept-only model when I fit the model on the whole data.

As far as I know I can use the first column to tell something about the predictive accuracy of the models. The second column can be used to tell something about the performance improvement of the model compared to the null model (comparable to R^2 in linear regression?).

As one can see, my full model doesn't perform better than the controls only model, despite other variables are being selected. Even when I create another model where I don't penalize the controls to keep them all in the model, one additional variable is selected by lasso.

What would you conlude?

Why is there no improvement in performance?

Are the predictors useful despite they don't improve the model?

Best Answer

Looking only at the CV error (I don't think a training-error figure like your "% deviance" is very helpful, especially for a penalized model like the lasso), I would conclude that, at least around your sample size, the predictors beyond the control variables aren't helpful for prediction. Adding them to the model only worsened predictive accuracy. So, they aren't useful.

Why they're not helpful for prediction is of course hard to guess at without knowing the context of this problem.

By the way, the right way to compare your models' predictive accuracy to that of a trivial model is not to look at training error, but to compute CV error for the trivial model.