I am running glmnet
for the first time and I am getting some weird results.
My dataset has n = 139; p = 70 (correlated variables)
I am trying to estimate the effect of each variable for both, inference and prediction.
I am running:
> cvfit = cv.glmnet(X, Y,family = c('gaussian'),alpha = 0.5,intercept = T,standardize = T,nlambda=100,type = "mse")
> coef(cvfit, s = "lambda.min")
From all the 70 estimates, two caught my attention:
4 0.5731999
14 5.419356829
What bugs me is the fact that:
> cor(X[,4],Y)
[1,] 0.674714
> cor(X[,14],Y)
[1,] -0.01742419
In addition, if I standardize X
myself (using scale(X)
) and run it again:
> cvfit = cv.glmnet(scale(X), Y,family = c('gaussian'),alpha = 0.5,intercept = T,standardize = F,nlambda=100,type = "mse")
> coef(cvfit, s = "lambda.min")
I now get that 4 has the highest effect and variable "14" is about 5 times smaller. I couldn't find a good description about the normalization process in glmnet. Any clue as to why this is happening (I don't think its a bug, I just would like to understand why and which one is right)?
PS: I ran this many times, so I know it is not an effect of the sampling during the cross-validation.
Best Answer
I tracked down the standardization process of glmnet and documented it on the thinklab Platform there. This includes a comparison of the different ways to use standardization with glmnet.
Long story short, if you let glmnet standardize the coefficients (by relying on the default
standardize = TRUE
), glmnet performs standardization behind the scenes and reports everything, including the plots, the "de-standardized" way, in the coefficients' natural metrics.