Solved – How to do cross-validation with cv.glmnet (LASSO regression in R)

cross-validationglmnetlassor

I'm wondering how to approach properly training and testing a LASSO model using glmnet in R?

  • Specifically, I'm wondering how to do so if a lack of an external test data set necessitates I use cross-validation (or other similar approach) to test my LASSO model.

Let me break down my scenario:

I only have one data-set to inform and train my glmnet model. As a result, I'll have to use cross-validation to split up my data to also generate a way to test my model.

I'm already using cv.glmnet, which according to the package details:

Does k-fold cross-validation for glmnet, produces a plot, and returns a value for lambda.

  • Is the cross-validation performed in cv.glmnet simply to pick the best lambda, or is it also serving as a more general cross-validation procedure?

    • In other words, do I still need to do another cross-validation step to "test" my model?

I'm working with the assumption that, "yes I do."

That being the case, how do I approach cross validating my cv.glmnet model?

  • Do I have to do so manually, or is perhaps the caret function useful for glmnet models?

  • Do I use two concentric "loops" of cross validation?… Do I use an "inner loop" of CV via cv.glmnet to determine the best lambda value within each of k folds of an "external loop" of k-fold cross validation processing?

    • If I do cross-validation of my already cross-validating cv.glmnet model, how do I isolate the "best" model (from the "best" lambda value) from each cv.glmnet model within each fold of my otherwise "external loop" of cross validation?

      • Note: I'm defining "best" model as the model associated with a lambda that produces an MSE within 1 SE of the minimum … this is the $lambda.1se in the cv.glmnet model.

Context:

I'm trying to predict tree age ("age") based on tree diameter ("D"), D^2, and species ("factor(SPEC)"). [resulting equation: Age ~ D + factor(SPEC) + D^2]. I have ~50K rows of data, but the data is longitudinal (tracks individuals through time) and consists of ~65 species.

Best Answer

Is the cross-validation performed in cv.glmnet simply to pick the best lambda, or is it also serving as a more general cross-validation procedure?

It does almost everything needed in a cross-validation. For example, it fits possible lambda values on the data, chooses the best model and finally trains the model with the appropriate parameters.

For example, in the returned object::

cvm is the mean cross-validated error. cvsd is the estimated standard deviation.

Like other returned values, these are calculated on the test set. Finally, the

glmnet.fit gives the model trained on all the data (training + test) with the best parameters.

Do I have to do so manually, or is perhaps the caret function useful for glmnet models?

You need not do this manually. 'Caret' would be very useful, and is one of my favourite package because it works for all the other models with same syntax. I myself often use caret rather than cv.glmnet. However, in your scenario it is essentially the same.

Do I use two concentric "loops" of cross validation?... Do I use an "inner loop" of CV via cv.glmnet to determine the best lambda value within each of k folds of an "external loop" of k-fold cross validation processing?

You could do this and this concept is very similar to the idea of Nested Cross-Validation Nested cross validation for model selection.

If I do cross-validation of my already cross-validating cv.glmnet model, how do I isolate the "best" model (from the "best" lambda value) from each cv.glmnet model within each fold of my otherwise "external loop" of cross validation?

Just run a loop where you generate a training data and test data run cv.glmnet on training data and use the model glmnet.fit to predict on the test data.