I've read a few Q&As about this, but am still not sure I understand, why the coefficients from glmnet and caret models based on the same sample and the same hyper-parameters are slightly different. Would greatly appreciate an explanation!
I am using caret to train a ridge regression:
library(ISLR)
Hitters = na.omit(Hitters)
x = model.matrix(Salary ~ ., Hitters)[, -1] #Dropping the intercept column.
y = Hitters$Salary
set.seed(0)
train = sample(1:nrow(x), 7*nrow(x)/10)
library(caret)
set.seed(0)
train_control = trainControl(method = 'cv', number = 10)
grid = 10 ^ seq(5, -2, length = 100)
tune.grid = expand.grid(lambda = grid, alpha = 0)
ridge.caret = train(x[train, ], y[train],
method = 'glmnet',
trControl = train_control,
tuneGrid = tune.grid)
ridge.caret$bestTune
# alpha is 0 and best lambda is 242.0128
Now, I use the lambda (and alpha) found above to train a ridge regression for the whole data set. At the end, I extract the coefficients:
ridge_full <- train(x, y,
method = 'glmnet',
trControl = trainControl(method = 'none'),
tuneGrid = expand.grid(
lambda = ridge.caret$bestTune$lambda, alpha = 0)
)
coef(ridge_full$finalModel, s = ridge.caret$bestTune$lambda)
Finally, using exactly the same alpha and lambda, I try to fit the same ridge regression using glmnet package – and extract coefficients:
library(glmnet)
ridge_full2 = glmnet(x, y, alpha = 0, lambda = ridge.caret$bestTune$lambda)
coef(ridge_full2)
Best Answer
It seems like a bug in
caret
's implementation.First some notes about glmnet package:
glmnet()
recommends against giving one single value forlambda
. It is preferable to warm start with large values values oflambda
.predict.glmnet()
lets you override the value(s) oflambda
which was used to train the model (cfs
argument). However, when some supplied values ofs
differ from the originallambda
, the default behaviour is to use linear interpolation rather than re-fit a model withlambda=s
. This can be controlled with theexact
argument of predict().Caret provides a
loop()
function for its glmnet wrapper (see https://github.com/topepo/caret/blob/master/models/files/glmnet.R). This loop function will only fit one model per value of alpha, with the max corresponding value of lambda. The other lambda values are considered as "submodels". Their evaluation is deferred topredict()
rather thanfit()
, using theglmnet::predict(s=...)
feature mentioned above. This is efficient and fine but predict should specifyexact=TRUE
to obtain the same results as glmnet.With this setting, I don't think that
ridge.caret$finalModel
was effectively trained with the optimal lambda.