Solved – Cross validation after LASSO in complex survey data

cross-validationglmnetlassosurvey

I am trying to do model selection on some candidate predictors using LASSO with a continuous outcome. The goal is to select the optimal model with the best prediction performance, which usually can be done by K-fold cross validation after obtaining a solution path of the tuning parameters from LASSO. The issue here is that the data are from a complex multi-stage survey design (NHANES), with cluster sampling and stratification. The estimation part is not hard since glmnet in R can take sampling weights. But the cross validation part is less clear to me since observations now are not i.i.d anymore, and how can the procedure account for sampling weights representing a finite population?

So my questions are:

1) How to carry out K-fold cross validation with complex survey data to select the optimal tuning parameter? More specifically, how to appropriately partition the sample data into training and validation sets? And how to define the estimate of prediction error?

2) Is there an alternative way to select the optimal tuning parameter?

Best Answer

I don't have a detailed answer, just some pointers to work I've been meaning to read:

You could take a look at McConville (2011) on complex-survey LASSO, to be sure your use of LASSO is appropriate for your data. But maybe it's not a big deal if you're doing LASSO only for variable selection, then fitting something else to the remaining variables.

For cross-validation with complex survey data (though not LASSO), McConville also cites Opsomer & Miller (2005) and You (2009). But their methods seems to use leave-one-out CV, not K-fold.

Leave-one-out should be simpler to implement with complex surveys---there's less concern about how to partition the data appropriately. (On the other hand, it can take longer to run than K-fold. And if your goal is model selection, it's known that leave-one-out can be worse than K-fold for large samples.)