Solved – How to do cross-validation with cv.glmnet (LASSO regression in R)

cross-validationglmnetlassor

I'm wondering how to approach properly training and testing a LASSO model using glmnet in R?

Specifically, I'm wondering how to do so if a lack of an external test data set necessitates I use cross-validation (or other similar approach) to test my LASSO model.

Let me break down my scenario:

I only have one data-set to inform and train my glmnet model. As a result, I'll have to use cross-validation to split up my data to also generate a way to test my model.

I'm already using cv.glmnet, which according to the package details:

Does k-fold cross-validation for glmnet, produces a plot, and returns a value for lambda.

Is the cross-validation performed in cv.glmnet simply to pick the best lambda, or is it also serving as a more general cross-validation procedure?
- In other words, do I still need to do another cross-validation step to "test" my model?

I'm working with the assumption that, "yes I do."

That being the case, how do I approach cross validating my cv.glmnet model?

Do I have to do so manually, or is perhaps the caret function useful for glmnet models?
Do I use two concentric "loops" of cross validation?… Do I use an "inner loop" of CV via cv.glmnet to determine the best lambda value within each of k folds of an "external loop" of k-fold cross validation processing?
- If I do cross-validation of my already cross-validating cv.glmnet model, how do I isolate the "best" model (from the "best" lambda value) from each cv.glmnet model within each fold of my otherwise "external loop" of cross validation?
  - Note: I'm defining "best" model as the model associated with a lambda that produces an MSE within 1 SE of the minimum … this is the $lambda.1se in the cv.glmnet model.

Context:

I'm trying to predict tree age ("age") based on tree diameter ("D"), D^2, and species ("factor(SPEC)"). [resulting equation: Age ~ D + factor(SPEC) + D^2]. I have ~50K rows of data, but the data is longitudinal (tracks individuals through time) and consists of ~65 species.

Best Answer

Is the cross-validation performed in cv.glmnet simply to pick the best lambda, or is it also serving as a more general cross-validation procedure?

It does almost everything needed in a cross-validation. For example, it fits possible lambda values on the data, chooses the best model and finally trains the model with the appropriate parameters.

For example, in the returned object::

cvm is the mean cross-validated error. cvsd is the estimated standard deviation.

Like other returned values, these are calculated on the test set. Finally, the

glmnet.fit gives the model trained on all the data (training + test) with the best parameters.

Do I have to do so manually, or is perhaps the caret function useful for glmnet models?

You need not do this manually. 'Caret' would be very useful, and is one of my favourite package because it works for all the other models with same syntax. I myself often use caret rather than cv.glmnet. However, in your scenario it is essentially the same.

Do I use two concentric "loops" of cross validation?... Do I use an "inner loop" of CV via cv.glmnet to determine the best lambda value within each of k folds of an "external loop" of k-fold cross validation processing?

You could do this and this concept is very similar to the idea of Nested Cross-Validation Nested cross validation for model selection.

If I do cross-validation of my already cross-validating cv.glmnet model, how do I isolate the "best" model (from the "best" lambda value) from each cv.glmnet model within each fold of my otherwise "external loop" of cross validation?

Just run a loop where you generate a training data and test data run cv.glmnet on training data and use the model glmnet.fit to predict on the test data.

Related Solutions

Solved – Cross validation for lasso logistic regression

The short answer is, its up to you, depending on your interest. In the past I have used AIC for Lasso.

However it sounds like you are using this model for prediction, and thus using the mis-classification rate is a good idea. However misclassification can be categorized in many ways. Are you interested in the the absolute % classified correctly? Or maybe you just care about of those classified as 1 (or yes, etc), how many of those were classified correctly? I would do some reading into Positive Predictive values, Negative predictive values, etc.

https://en.wikipedia.org/wiki/Positive_and_negative_predictive_values

In addition when doing your cross validation, there are a plethora of criteria you could use to validate your model. A short list of other common criterion are:

$R^2$
$MSE$
$Mallow’s$ $C_p$
$AIC$

Look them up and see which is most relevant to you!

Solved – Cross validating lasso regression in R

An example on how to do vanilla plain cross-validation for lasso in glmnet on mtcars data set.

Load data set.
Prepare features (independent variables). They should be of matrix class. The easiest way to convert df containing categorical variables into matrix is via model.matrix. Mind you, by default glmnet fits intercept, so you'd better strip intercept from model matrix.
Prepare response (dependent variable). Let's code cars with above average mpg as efficient ('1') and the rest as inefficient ('0'). Convert this variable to factor.
Run cross-validation via cv.glmnet. It will pickup alpha=1 from default glmnet parameters, which is what you asked for: lasso regression.
By examining the output of cross-validation you may be interested in at least 2 pieces of information:
- lambda, that minimizes cross-validated error. glmnet actually provides 2 lambdas: lambda.min and lambda.1se. It's your judgement call as a practicing statistician which to use.
- resulting regularized coefficients.

Please see the R code per the above instructions:

# Load data set
data("mtcars")

# Prepare data set 
x   <- model.matrix(~.-1, data= mtcars[,-1])
mpg <- ifelse( mtcars$mpg < mean(mtcars$mpg), 0, 1)
y   <- factor(mpg, labels = c('notEfficient', 'efficient'))

library(glmnet)

# Run cross-validation
mod_cv <- cv.glmnet(x=x, y=y, family='binomial')

mod_cv$lambda.1se
[1] 0.108442

coef(mod_cv, mod_cv$lambda.1se)
                     1
(Intercept)  5.6971598
cyl         -0.9822704
disp         .        
hp           .        
drat         .        
wt           .        
qsec         .        
vs           .        
am           .        
gear         .        
carb         .  

mod_cv$lambda.min
[1] 0.01537137

coef(mod_cv, mod_cv$lambda.min)
                      1
(Intercept)  6.04249733
cyl         -0.95867199
disp         .         
hp          -0.01962924
drat         0.83578090
wt           .         
qsec         .         
vs           .         
am           2.65798203
gear         .         
carb        -0.67974620

Final comments:

note, the model's output says nothing about statistical significance of the coefficients, only values.
l1 penalizer (lasso), which you asked for, is notorious for instability as evidenced in this blog post and this stackexchange question. A better way could be to cross-validate on alpha too, which would let you decide on proper mix of l1 and l2 penalizers.
an alternative way to do cross-validation could be to turn to caret's train( ... method='glmnet')
and finally, the best way to learn more about cv.glmnet and it's defaults coming from glmnet is of course ?glmnet in R's console )))