GLMNet – Why Is cv.glmnet Giving a Lambda.min That Is Clearly Not the Lambda for Minimum Error?

glmnetoverfitting

I have X possible predictors for response Y. In my case X >> Y.

I have noticed in my runs of cv.glmnet (leave-on-out and all other params default) that if I try to predict using lambda.min that it simply returns the mean value of Y. If I run the prediction with choices of lambda < lambda.min, it gives actual predictions – which have a lower error than using the mean value of Y.

I'm not sure what's going on here. It's as if the code is defaulting to a dummy predictor (the mean response) for some reason. It appears that this behavior is a function of the size of X.

Here's a simple example:

x=replicate(100,rnorm(10))

y=replicate(1,rnorm(10))

cvfit=cv.glmnet(x,y,nfolds=10)

ypred1=predict(cvfit,newx=x,s="lambda.min")

(in a case I just ran, this gives a cvfit$lambda.min = 0.8453387 and all entries in
ypred1 are the mean value of y. So, let's choose a different lambda)

ypred2=predict(cvfit,newx=x,s=0.1)

mse1=mean((ypred1-y)^2) = 1.20

mse2=mean((ypred2-y)^2) = 0.03

I understand that "newx=x" doesn't make sense for any real work, but I don't understand why it returns the predictions it does.

Best Answer

Here, glmnet is working as intended! In your example, there is no relationship between $x$ and $y$ (both were independently generated). So the ``correct'' thing to do is to just always predict $\hat{y} = \bar{y}.$ Any method that isn't doing that is overfitting the test set.

Related Solutions

GLMNet – How to Interpret GLMNet in R?

Here's an unintuitive fact - you're not actually supposed to give glmnet a single value of lambda. From the documentation here:

Do not supply a single value for lambda (for predictions after CV use predict() instead). Supply instead a decreasing sequence of lambda values. glmnet relies on its warms starts for speed, and its often faster to ﬁt a whole path than compute a single ﬁt.

cv.glmnet will help you choose lambda, as you alluded to in your examples. The authors of the glmnet package suggest cv$lambda.1se instead of cv$lambda.min, but in practice I've had success with the latter.

After running cv.glmnet, you don't have to rerun glmnet! Every lambda in the grid (cv$lambda) has already been run. This technique is called "Warm Start" and you can read more about it here. Paraphrasing from the introduction, the Warm Start technique reduces running time of iterative methods by using the solution of a different optimization problem (e.g., glmnet with a larger lambda) as the starting value for a later optimization problem (e.g., glmnet with a smaller lambda).

To extract the desired run from cv.glmnet.fit, try this:

small.lambda.index <- which(cv$lambda == cv$lambda.min)
small.lambda.betas <- cv$glmnet.fit$beta[, small.lambda.index]

Revision (1/28/2017)

No need to hack to the glmnet object like I did above; take @alex23lemm's advice below and pass the s = "lambda.min", s = "lambda.1se" or some other number (e.g., s = .007) to both coef and predict. Note that your coefficients and predictions depend on this value which is set by cross validation. Use a seed for reproducibility! And don't forget that if you don't supply an "s" in coef and predict, you'll be using the default of s = "lambda.1se". I have warmed up to that default after seeing it work better in a small data situation. s = "lambda.1se" also tends to provide more regularization, so if you're working with alpha > 0, it will also tend towards a more parsimonious model. You can also choose a numerical value of s with the help of plot.glmnet to get to somewhere in between (just don't forget to exponentiate the values from the x axis!).

Lambda in Elastic Net Regression – Why ‘Within One Standard Error from the Minimum’ Is Recommended

Friedman, Hastie, and Tibshirani (2010), citing The Elements of Statistical Learning, write,

We often use the “one-standard-error” rule when selecting the best model; this acknowledges the fact that the risk curves are estimated with error, so errs on the side of parsimony.

The reason for using one standard error, as opposed to any other amount, seems to be because it's, well... standard. Krstajic, et al (2014) write (bold emphasis mine):

Breiman et al. [25] have found in the case of selecting optimal tree size for classification tree models that the tree size with minimal cross-validation error generates a model which generally overfits. Therefore, in Section 3.4.3 of their book Breiman et al. [25] define the one standard error rule (1 SE rule) for choosing an optimal tree size, and they implement it throughout the book. In order to calculate the standard error for single V-fold cross- validation, accuracy needs to be calculated for each fold, and the standard error is calculated from V accuracies from each fold. Hastie et al. [4] define the 1 SE rule as selecting the most parsimonious model whose error is no more than one standard error above the error of the best model, and they suggest in several places using the 1 SE rule for general cross-validation use. The main point of the 1 SE rule, with which we agree, is to choose the simplest model whose accuracy is comparable with the best model.

The suggestion is that the choice of one standard error is entirely heuristic, based on the sense that one standard error typically is not large relative to the range of $\lambda$ values.

Best Answer

Related Solutions

GLMNet – How to Interpret GLMNet in R?

Lambda in Elastic Net Regression – Why ‘Within One Standard Error from the Minimum’ Is Recommended

Related Question