Solved – When using glmnet how to report p-value significance to claim significance of predictors

glmnetlassomultiple regressionr

I have a large set of predictors (more than 43,000) for predicting a dependent variable which can take 2 values (0 or 1). The number of observations is more than 45,000. Most of the predictors are unigrams, bigrams and trigrams of words, so there is high degree of collinearity among them. There is a lot of sparsity in my dataset as well. I am using the logistic regression from the glmnet package, which works for the kind of dataset I have. My problem is how can I report p-value significance of the predictors. I do get the beta coefficient, but is there a way to claim that the beta coefficients are statistically significant?

Here is my code:

library('glmnet')
data <- read.csv('datafile.csv', header=T)
mat = as.matrix(data)
X = mat[,1:ncol(mat)-1] 
y = mat[,ncol(mat)]
fit <- cv.glmnet(X,y, family="binomial")

Another question is:
I am using the default alpha=1, lasso penalty which causes the additional problem that if two predictors are collinear the lasso will pick one of them at random and assign zero beta weight to the other. I also tried with ridge penalty (alpha=0) which assigns similar coefficients to highly correlated variables rather than selecting one of them. However, the model with lasso penalty gives me a much lower deviance than the one with ridge penalty. Is there any other way that I can report both predictors which are highly collinear?

Best Answer

There is a new paper, A Signiﬁcance Test for the Lasso, including the inventor of LASSO as an author that reports results on this problem. This is a relatively new area of research, so the references in the paper cover a lot of what is known at this point.

As for your second question, have you tried $\alpha \in (0,1)$? Often there is a value in this middle range that achieves a good compromise. This is called Elastic Net regularization. Since you are using cv.glmnet, you will probably want to cross-validate over a grid of $(\lambda, \alpha)$ values.

Related Solutions

Solved – How to report most important predictors using glmnet

set alpha = 0 in cv.glmnet() to use ridge instead of lasso.

"It is known that the ridge penalty shrinks the coefficients of correlated predictors towards each other while the lasso tends to pick one of them and discard the others." glmnet manual

You are already sampling the data by using cv.glmnet() (as opposed to simply using glmnet())
It is my understanding that for each lambda, you have a model. So lambda.min is the lambda-value for the model with the lowest error.

User Jason has example code posted in another question, that I believe will help: https://stats.stackexchange.com/a/92167

Solved – How to interpret glmnet

Here's an unintuitive fact - you're not actually supposed to give glmnet a single value of lambda. From the documentation here:

Do not supply a single value for lambda (for predictions after CV use predict() instead). Supply instead a decreasing sequence of lambda values. glmnet relies on its warms starts for speed, and its often faster to ﬁt a whole path than compute a single ﬁt.

cv.glmnet will help you choose lambda, as you alluded to in your examples. The authors of the glmnet package suggest cv$lambda.1se instead of cv$lambda.min, but in practice I've had success with the latter.

After running cv.glmnet, you don't have to rerun glmnet! Every lambda in the grid (cv$lambda) has already been run. This technique is called "Warm Start" and you can read more about it here. Paraphrasing from the introduction, the Warm Start technique reduces running time of iterative methods by using the solution of a different optimization problem (e.g., glmnet with a larger lambda) as the starting value for a later optimization problem (e.g., glmnet with a smaller lambda).

To extract the desired run from cv.glmnet.fit, try this:

small.lambda.index <- which(cv$lambda == cv$lambda.min)
small.lambda.betas <- cv$glmnet.fit$beta[, small.lambda.index]

Revision (1/28/2017)

No need to hack to the glmnet object like I did above; take @alex23lemm's advice below and pass the s = "lambda.min", s = "lambda.1se" or some other number (e.g., s = .007) to both coef and predict. Note that your coefficients and predictions depend on this value which is set by cross validation. Use a seed for reproducibility! And don't forget that if you don't supply an "s" in coef and predict, you'll be using the default of s = "lambda.1se". I have warmed up to that default after seeing it work better in a small data situation. s = "lambda.1se" also tends to provide more regularization, so if you're working with alpha > 0, it will also tend towards a more parsimonious model. You can also choose a numerical value of s with the help of plot.glmnet to get to somewhere in between (just don't forget to exponentiate the values from the x axis!).

Best Answer

Related Solutions

Solved – How to report most important predictors using glmnet

Solved – How to interpret glmnet

Related Question