Solved – How to ensure that the most appropriate value for lambda is chosen in lasso

My situation:

small sample size: 116
binary outcome variable
long list of explanatory variables: 50
explanatory variables did not come from the top of my head; their choice was based on the literature.

Following a suggestion to a previous question of mine, I have run LASSO (using R's glmnet package) in order to select the subset of exaplanatory variables that best explain variations in my binary outcome variable.

I have noticed that I get very different values of lambda.min through k-folds cross-validation (cv.glmnet command) according the value I attribute to k. I have tried the default (10) and 5. Which would be the most appropriate value for k, considering my sample size?

In my specific case, is it necessary to repeat cross-validation, say 100 times, in order to reduce randomness and allow averaging the error curves, as is suggested in this post? If so: I have tried the code suggested in that post, but got error messages, could anyone suggest a better code?

UPDATE1: I have managed to use the foldid option in cv.glmnet, as suggested in the comments below, by organizing my x-matrix in a way that all the 32 observations belonging to one of my outcome classes appears in lines 1-32 and by using the folowing code: foldid=c(sample(rep(seq(10),length=32),sample(rep(seq(10),length=84)). However, when I ran cv.glmnet, only one of the levels of a categorical variable with four levels was included in the model. So following a suggestion to a previous question of mine, I tried to run group-lasso using R's gglasso package. And now I am facing this issue.

Best Answer

10-fold cross-validation often is considered as the gold standard because of the compromise between bias and variance. If I understand correctly (because statistics and machine learning is not my main topic), if you go to larger number of folds, your error estimate will greatly depend on your data. As a consequence, the error estimate will have high variance and low bias.

I would say, if you know that another set of samples would show approximately the same values as you have (that means you have low variance), you can use larger number of folds (even LOOCV). Otherwise, I would leave 10-fold CV.

Best Answer

Related Solutions

Solved – Interpretation and validation of a Cox proportional hazards regression model using R in plain English

Solved – How to select a subset of variables from the original long list in order to perform logistic regression analysis

Related Question