Solved – cv.glmnet lambda stability

glmnetlassor

Intro

I am running cv.glmnet from the glmnet package in R. I am running 10-fold cross validation 100 times on a dataset that has 25,000 observations and 150 variables. I am using the parallel = TRUE option, as that greatly speeds up results.

Problem Description

  • I notice that the minimum lambda value (lambda.min) and the lambda value within 1 standard error of the minimum (lambda.1se) are exactly the same in 99.9% of all 100 cross validation runs.
  • Each cross validation run leads to the calculation of exactly 25 lambdas – no more and no less (recall that the default is that the number of lambdas depends on whether the deviance changes sufficiently from one run to the next). I have left the default settings as they are, which allows for the calculation of 100 lambdas unless the deviance stops changing significantly.
  • The cross validation error varies minimally (average of 0.002).
  • If I run another few iterations of 100 10-fold cross validations on a different split into training and test populations, I get a slightly different minimum lambda and lambda within 1 standard error of the minimum from the other iterations. The above points hold, however, for the given iteration (i.e. same lambda, same exact number of lambdas calculated per cross validation run, minimal change in cross validation error).

Final Question

Why are all my cross validation results within a given iteration so similar (perhaps because all folds in the cross validation are always very similar)?

Best Answer

Folds should be different (if you don't fix seed) with every iteration.

I think there is nothing suprising here since you have a lot of observations (assuming your data is not sparse) and relatively few variables. In this setting it is quite natural if you do 10-fold cross-validation that estimated relationships between your variables on 90% of your dataset will match the ones on full dataset quite closely and be similar to each other.

Low cross-validation error is not a problem. However when you fit the model you should check that you get similar error by making predictions on a test set (again assuming that data was split into test and train randomly).

Related Question