For these models, they are regression the coefficients for the final Model. Big coefficients are associated with larger effects. Using scale = FALSE
is good here so you can also get the signs too.
There are always pitfalls with these measures depending on how you want to measure importance. They don't measure lack of fit at all, so if your model is 51% accurate, they are not very reflective of the data. In the case of regression coefficients, main effects are misleading when interactions are present and so on.
As for correlation between predictors, Friedman et al. (2010, JSS) state:
Ridge regression is known to shrink the coefficients of correlated predictors towards each other, allowing them to borrow strength from each other. In the extreme case of $k$ identical predictors, they each get identical coefficients with $1/k^{th}$ the size that any single one would get if fit alone.[...]
Lasso, on the other hand, is somewhat indifferent to very correlated predictors, and will tend to pick one and ignore the rest.
We have a pretty good example of that in Section 6.4 of APM
Max
If you check the lambdas and your best lambda obtained from caret, you will see that it is not present in the model:
lassoFit1$bestTune$lambda
[1] 0.01545996
lassoFit1$bestTune$lambda %in% lassoFit1$finalModel$lambda
[1] FALSE
If you do:
coef(lassoFit1$finalModel,lassoFit1$bestTune$lambda)
8 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -4.532659e-15
Population 1.493984e-01
Income .
Illiteracy .
Murder -7.929823e-01
HS.Grad 2.669362e-01
Frost -1.979238e-01
Area .
It will give you the values from the lambda it tested, that is closest to your best tune lambda. You can of course re-fit the model again with your specified lambda and alpha:
fit = glmnet(x=statedata[,c(1:3,5,6,7,8)],y=statedata[,4],
lambda=lassoFit1$bestTune$lambda,alpah=lassoFit1$bestTune$alpha)
> fit$beta
7 x 1 sparse Matrix of class "dgCMatrix"
s0
Population 0.1493747
Income .
Illiteracy .
Murder -0.7929223
HS.Grad 0.2669745
Frost -0.1979134
Area .
Which you can see is close enough to the first approximation.
To answer your other questions:
I get the coefficients. Is this the best model?
You did coef(cvfit, s="lambda.min")
which is the lambda with the least error. If you read the glmnet paper, they go with Breimen's 1SE rule (see this for a complete view), as it calls uses a less complicated model. You might want to consider using coef(cvfit, s="lambda.1se")
.
does test more lambdas in the cross validation, is that true? Does
caret or glmnet lead to a better model?It looks like glmnet
by default cv.glmnet
test a defined number of lambdas, in this example it is 67 but you can specify more by passing lambda=<your set of lambda to test>
. You should get similar values using caret
or cv.glmnet
, but note that you cannot vary alpha with cv.glmnet()
How do I manage to extrage the best final model from caret and glmnet
and plug it in a cox hazard model for example?
I guess you want to take the non-zero coefficients. and you can do this by
#exclude intercept
res = coef(cvfit, s="lambda.1se")[-1,]
names(res)[which(res!=0)]
[1] "Murder" "HS.Grad"
Best Answer
train
does tune over both.Basically, you only need
alpha
when training and can get predictions across different values oflambda
usingpredict.glmnet
. Maybe a value oflambda = "all"
or something else would be more informative.Max