Solved – r: coefficients from glmnet and caret are different for the same lambda

caretglmnetrregression coefficients

I've read a few Q&As about this, but am still not sure I understand, why the coefficients from glmnet and caret models based on the same sample and the same hyper-parameters are slightly different. Would greatly appreciate an explanation!

I am using caret to train a ridge regression:

library(ISLR)
Hitters = na.omit(Hitters)
x = model.matrix(Salary ~ ., Hitters)[, -1] #Dropping the intercept column.
y = Hitters$Salary

set.seed(0)
train = sample(1:nrow(x), 7*nrow(x)/10)

library(caret)
set.seed(0)
train_control = trainControl(method = 'cv', number = 10)
grid = 10 ^ seq(5, -2, length = 100)
tune.grid = expand.grid(lambda = grid, alpha = 0)
ridge.caret = train(x[train, ], y[train],
                    method = 'glmnet',
                    trControl = train_control,
                    tuneGrid = tune.grid)
ridge.caret$bestTune
# alpha is 0 and best lambda is 242.0128

Now, I use the lambda (and alpha) found above to train a ridge regression for the whole data set. At the end, I extract the coefficients:

ridge_full <- train(x, y,
                    method = 'glmnet',
                    trControl = trainControl(method = 'none'), 
                    tuneGrid = expand.grid(
                      lambda = ridge.caret$bestTune$lambda, alpha = 0)
                    )
coef(ridge_full$finalModel, s = ridge.caret$bestTune$lambda)

Finally, using exactly the same alpha and lambda, I try to fit the same ridge regression using glmnet package – and extract coefficients:

library(glmnet)
ridge_full2 = glmnet(x, y, alpha = 0, lambda = ridge.caret$bestTune$lambda)
coef(ridge_full2)

Best Answer

It seems like a bug in caret's implementation.

First some notes about glmnet package:

  • The documentation of glmnet() recommends against giving one single value for lambda. It is preferable to warm start with large values values of lambda.
  • predict.glmnet() lets you override the value(s) of lambda which was used to train the model (cf s argument). However, when some supplied values of s differ from the original lambda, the default behaviour is to use linear interpolation rather than re-fit a model with lambda=s. This can be controlled with the exact argument of predict().

Caret provides a loop() function for its glmnet wrapper (see https://github.com/topepo/caret/blob/master/models/files/glmnet.R). This loop function will only fit one model per value of alpha, with the max corresponding value of lambda. The other lambda values are considered as "submodels". Their evaluation is deferred to predict() rather than fit(), using the glmnet::predict(s=...) feature mentioned above. This is efficient and fine but predict should specify exact=TRUE to obtain the same results as glmnet.

With this setting, I don't think that ridge.caret$finalModel was effectively trained with the optimal lambda.

Related Question