Solved – Coefficient value from glmnet

glmnetstandardization

I am running glmnet for the first time and I am getting some weird results.

My dataset has n = 139; p = 70 (correlated variables)

I am trying to estimate the effect of each variable for both, inference and prediction.
I am running:

> cvfit = cv.glmnet(X, Y,family = c('gaussian'),alpha = 0.5,intercept = T,standardize = T,nlambda=100,type = "mse")

> coef(cvfit, s = "lambda.min")

From all the 70 estimates, two caught my attention:

4           0.5731999

14          5.419356829

What bugs me is the fact that:

> cor(X[,4],Y)

[1,] 0.674714

> cor(X[,14],Y)

[1,] -0.01742419

In addition, if I standardize X myself (using scale(X)) and run it again:

> cvfit = cv.glmnet(scale(X), Y,family = c('gaussian'),alpha = 0.5,intercept = T,standardize = F,nlambda=100,type = "mse")

> coef(cvfit, s = "lambda.min")

I now get that 4 has the highest effect and variable "14" is about 5 times smaller. I couldn't find a good description about the normalization process in glmnet. Any clue as to why this is happening (I don't think its a bug, I just would like to understand why and which one is right)?

PS: I ran this many times, so I know it is not an effect of the sampling during the cross-validation.

Best Answer

I tracked down the standardization process of glmnet and documented it on the thinklab Platform there. This includes a comparison of the different ways to use standardization with glmnet.

Long story short, if you let glmnet standardize the coefficients (by relying on the default standardize = TRUE), glmnet performs standardization behind the scenes and reports everything, including the plots, the "de-standardized" way, in the coefficients' natural metrics.