Solved – How to report most important predictors using glmnet

glmnetlassologisticrregression

I want to find the most important predictors for a binomial dependent variable out of a set of more than 43,000 independent variables (These form the columns of my input dataset). The number of observations is more than 45,000 (these form the rows of my input dataset). Most of the independent variables are unigrams, bigrams and trigrams of words, so there is high degree of collinearity among them. There is a lot of sparsity in my dataset as well. I am using the logistic regression from the glmnet package, which works for the kind of dataset I have. Here is some code:

library('glmnet')
data <- read.csv('datafile.csv', header=T)
mat = as.matrix(data)
X = mat[,1:ncol(mat)-1] 
y = mat[,ncol(mat)]
fit <- cv.glmnet(X,y, family="binomial", type.measure = "class")
betacoeff = as.matrix(fit$glmnet.fit$beta[,ncol(fit$glmnet.fit$beta)])

betacoeff returns the betas for all the independent variables. I am thinking of showing the predictors corresponding to the top 50 betas as the most important predictors.
My questions are:

  1. glmnet picks one good predictor out of a bunch of highly correlated good predictors. So I am not sure how much I can rely on the betas returned by the above model run.

  2. Should I manually sample the data (say 10 times) and each time run the above model, get the list of predictors with the top betas and then find those which are present in all 10 repetitions? Is there any standard way of doing this? What is the standard way of sampling in this case?

  3. My other question is about cvm (cross validation error) returned by the above model. Since I use type.measure = "class", cvm gives the misclassification error for different values of lambda. How do I report the misclassification error for the entire model? Is it the cvm corresponding to lambda.min?

Best Answer

  1. set alpha = 0 in cv.glmnet() to use ridge instead of lasso.

"It is known that the ridge penalty shrinks the coefficients of correlated predictors towards each other while the lasso tends to pick one of them and discard the others." glmnet manual

  1. You are already sampling the data by using cv.glmnet() (as opposed to simply using glmnet())

  2. It is my understanding that for each lambda, you have a model. So lambda.min is the lambda-value for the model with the lowest error.

User Jason has example code posted in another question, that I believe will help: https://stats.stackexchange.com/a/92167