I'm wondering how to approach properly training and testing a LASSO model using glmnet in R?
- Specifically, I'm wondering how to do so if a lack of an external test data set necessitates I use cross-validation (or other similar approach) to test my LASSO model.
Let me break down my scenario:
I only have one data-set to inform and train my glmnet model. As a result, I'll have to use cross-validation to split up my data to also generate a way to test my model.
I'm already using cv.glmnet
, which according to the package details:
Does k-fold cross-validation for glmnet, produces a plot, and returns a value for lambda.
-
Is the cross-validation performed in
cv.glmnet
simply to pick the best lambda, or is it also serving as a more general cross-validation procedure?- In other words, do I still need to do another cross-validation step to "test" my model?
I'm working with the assumption that, "yes I do."
That being the case, how do I approach cross validating my cv.glmnet
model?
-
Do I have to do so manually, or is perhaps the
caret
function useful for glmnet models? -
Do I use two concentric "loops" of cross validation?… Do I use an "inner loop" of CV via
cv.glmnet
to determine the best lambda value within each of k folds of an "external loop" of k-fold cross validation processing?-
If I do cross-validation of my already cross-validating
cv.glmnet
model, how do I isolate the "best" model (from the "best" lambda value) from eachcv.glmnet
model within each fold of my otherwise "external loop" of cross validation?- Note: I'm defining "best" model as the model associated with a lambda that produces an MSE within 1 SE of the minimum … this is the
$lambda.1se
in thecv.glmnet
model.
- Note: I'm defining "best" model as the model associated with a lambda that produces an MSE within 1 SE of the minimum … this is the
-
Context:
I'm trying to predict tree age ("age") based on tree diameter ("D"), D^2, and species ("factor(SPEC)"). [resulting equation: Age ~ D + factor(SPEC) + D^2
]. I have ~50K rows of data, but the data is longitudinal (tracks individuals through time) and consists of ~65 species.
Best Answer
Is the cross-validation performed in cv.glmnet simply to pick the best lambda, or is it also serving as a more general cross-validation procedure?
It does almost everything needed in a cross-validation. For example, it fits possible
lambda
values on the data, chooses the best model and finally trains the model with the appropriate parameters.For example, in the returned object::
cvm
is the mean cross-validated error.cvsd
is the estimated standard deviation.Like other returned values, these are calculated on the test set. Finally, the
glmnet.fit
gives the model trained on all the data (training + test) with the best parameters.Do I have to do so manually, or is perhaps the caret function useful for glmnet models?
You need not do this manually. 'Caret' would be very useful, and is one of my favourite package because it works for all the other models with same syntax. I myself often use
caret
rather thancv.glmnet
. However, in your scenario it is essentially the same.Do I use two concentric "loops" of cross validation?... Do I use an "inner loop" of CV via cv.glmnet to determine the best lambda value within each of k folds of an "external loop" of k-fold cross validation processing?
You could do this and this concept is very similar to the idea of Nested Cross-Validation Nested cross validation for model selection.
If I do cross-validation of my already cross-validating cv.glmnet model, how do I isolate the "best" model (from the "best" lambda value) from each cv.glmnet model within each fold of my otherwise "external loop" of cross validation?
Just run a loop where you generate a training data and test data run
cv.glmnet
on training data and use the modelglmnet.fit
to predict on the test data.