Solved – Cross validation after LASSO in complex survey data

cross-validationglmnetlassosurvey

I am trying to do model selection on some candidate predictors using LASSO with a continuous outcome. The goal is to select the optimal model with the best prediction performance, which usually can be done by K-fold cross validation after obtaining a solution path of the tuning parameters from LASSO. The issue here is that the data are from a complex multi-stage survey design (NHANES), with cluster sampling and stratification. The estimation part is not hard since glmnet in R can take sampling weights. But the cross validation part is less clear to me since observations now are not i.i.d anymore, and how can the procedure account for sampling weights representing a finite population?

So my questions are:

1) How to carry out K-fold cross validation with complex survey data to select the optimal tuning parameter? More specifically, how to appropriately partition the sample data into training and validation sets? And how to define the estimate of prediction error?

2) Is there an alternative way to select the optimal tuning parameter?

Best Answer

I don't have a detailed answer, just some pointers to work I've been meaning to read:

You could take a look at McConville (2011) on complex-survey LASSO, to be sure your use of LASSO is appropriate for your data. But maybe it's not a big deal if you're doing LASSO only for variable selection, then fitting something else to the remaining variables.

For cross-validation with complex survey data (though not LASSO), McConville also cites Opsomer & Miller (2005) and You (2009). But their methods seems to use leave-one-out CV, not K-fold.

Leave-one-out should be simpler to implement with complex surveys---there's less concern about how to partition the data appropriately. (On the other hand, it can take longer to run than K-fold. And if your goal is model selection, it's known that leave-one-out can be worse than K-fold for large samples.)

Related Solutions

Solved – Lasso cross validation

Basically you select whichever $\alpha$ gives you the lowest error rate (on a validation set). So to be complete cross-validation entails the following steps:

Split your data in three parts: training, validation and test.
Train a model with a given $\alpha$ on the train-set and test it on the validation-set and repeat this for the full range of possible $\alpha$ values in your grid.
Pick the best $\alpha$ value (i.e. the one that gives the lowest error)
Once you have complete this, retrain a new model using this optimal value of $\alpha$ on (trainset+validationset).
You can now evaluate your model on the test-set.

I haven't gone over your code, but your graph suggests to also look for even lower values of $\alpha$. Note, however, that learning curves usually look something like this when evaluated on the validation set:

That is, your error should be high in the beginning, drop to a low and then go somewhat up again.

The comment of @frank-harrell relates to the fact that you should probably repeat this experiment a few times to get robust estimates of your $\alpha$ values. For this you can also use k-fold cross-validation like you did so you should be fine.

Solved – Cross validating lasso regression in R

An example on how to do vanilla plain cross-validation for lasso in glmnet on mtcars data set.

Load data set.
Prepare features (independent variables). They should be of matrix class. The easiest way to convert df containing categorical variables into matrix is via model.matrix. Mind you, by default glmnet fits intercept, so you'd better strip intercept from model matrix.
Prepare response (dependent variable). Let's code cars with above average mpg as efficient ('1') and the rest as inefficient ('0'). Convert this variable to factor.
Run cross-validation via cv.glmnet. It will pickup alpha=1 from default glmnet parameters, which is what you asked for: lasso regression.
By examining the output of cross-validation you may be interested in at least 2 pieces of information:
- lambda, that minimizes cross-validated error. glmnet actually provides 2 lambdas: lambda.min and lambda.1se. It's your judgement call as a practicing statistician which to use.
- resulting regularized coefficients.

Please see the R code per the above instructions:

# Load data set
data("mtcars")

# Prepare data set 
x   <- model.matrix(~.-1, data= mtcars[,-1])
mpg <- ifelse( mtcars$mpg < mean(mtcars$mpg), 0, 1)
y   <- factor(mpg, labels = c('notEfficient', 'efficient'))

library(glmnet)

# Run cross-validation
mod_cv <- cv.glmnet(x=x, y=y, family='binomial')

mod_cv$lambda.1se
[1] 0.108442

coef(mod_cv, mod_cv$lambda.1se)
                     1
(Intercept)  5.6971598
cyl         -0.9822704
disp         .        
hp           .        
drat         .        
wt           .        
qsec         .        
vs           .        
am           .        
gear         .        
carb         .  

mod_cv$lambda.min
[1] 0.01537137

coef(mod_cv, mod_cv$lambda.min)
                      1
(Intercept)  6.04249733
cyl         -0.95867199
disp         .         
hp          -0.01962924
drat         0.83578090
wt           .         
qsec         .         
vs           .         
am           2.65798203
gear         .         
carb        -0.67974620

Final comments:

note, the model's output says nothing about statistical significance of the coefficients, only values.
l1 penalizer (lasso), which you asked for, is notorious for instability as evidenced in this blog post and this stackexchange question. A better way could be to cross-validate on alpha too, which would let you decide on proper mix of l1 and l2 penalizers.
an alternative way to do cross-validation could be to turn to caret's train( ... method='glmnet')
and finally, the best way to learn more about cv.glmnet and it's defaults coming from glmnet is of course ?glmnet in R's console )))

Best Answer

Related Solutions

Solved – Lasso cross validation

Solved – Cross validating lasso regression in R

Related Question