Solved – How to present results of a Lasso using glmnet

glmnetlassomultiple regressionpresentationr

I would like to find predictors for a continuous dependent variable out of a set of 30 independent variables. I am using Lasso regression as implemented in the glmnet package in R. Here is some dummy code:

# generate a dummy dataset with 30 predictors (10 useful & 20 useless) 
y=rnorm(100)
x1=matrix(rnorm(100*20),100,20)
x2=matrix(y+rnorm(100*10),100,10)
x=cbind(x1,x2)

# use crossvalidation to find the best lambda
library(glmnet)
cv <- cv.glmnet(x,y,alpha=1,nfolds=10)
l <- cv$lambda.min
alpha=1

# fit the model
fits <- glmnet( x, y, family="gaussian", alpha=alpha, nlambda=100)
res <- predict(fits, s=l, type="coefficients")
res 

My questions is how to interpret the output:

  • Is it correct to say that in the final output all predictors that show a coefficient different from zero are related to the dependent variable?

  • Would that be a sufficient report in the context of a journal publication? Or is it expected to provide test-statistics for the significance of the coefficients? (The context is human genetics)

  • Is it reasonable to calculate p-values or other test-statistic to claim significance? How would that be possible? Is a procedure implemented in R?

  • Would a simple regression plot (data points plotted with a linear fit) for every predictor be a suitable way to visualize this data?

  • Maybe someone can provide some easy examples of published articles showing the use of Lasso in the context of some real data & how to report this in a journal?

Best Answer

My understanding is that you can't necessarily say much about which variables are "important" or have "real" effects based on whether their coefficients are nonzero. To give an extreme example, if you have two predictors that are perfectly collinear, the lasso will pick one of them essentially at random to get the full weight and the other one will get zero weight.

This paper, which includes one of the authors of glmnet, presents some glmnet-based analyses (see especially: the Introduction, Sections 2.3 and 4.3, and Tables 4 and 5). Glancing through, it looks like they didn't calculate P-valued directly from the glmnet model. They did calculate two different kinds of P-values using other methods, but it doesn't look like they fully trust either of them.

I'm not 100% sure what you're suggesting in terms of plotting methods, but I think it sounds reasonable.

Hope that helps.