Lambda in Elastic Net Regression – Why ‘Within One Standard Error from the Minimum’ Is Recommended

cross-validationelastic netglmnetregressionregularization

I understand what role lambda plays in an elastic-net regression. And I can understand why one would select lambda.min, the value of lambda that minimizes cross validated error.

My question is Where in the statistics literature is it recommended to use lambda.1se, that is the value of lambda that minimizes CV error plus one standard error? I can't seem to find a formal citation, or even a reason for why this is often a good value. I understand that it's a more restrictive regularization, and will shrink the parameters more towards zero, but I'm not always certain of the conditions under which lambda.1se is a better choice over lambda.min. Can someone help explain?

Best Answer

Friedman, Hastie, and Tibshirani (2010), citing The Elements of Statistical Learning, write,

We often use the “one-standard-error” rule when selecting the best model; this acknowledges the fact that the risk curves are estimated with error, so errs on the side of parsimony.

The reason for using one standard error, as opposed to any other amount, seems to be because it's, well... standard. Krstajic, et al (2014) write (bold emphasis mine):

Breiman et al. [25] have found in the case of selecting optimal tree size for classification tree models that the tree size with minimal cross-validation error generates a model which generally overfits. Therefore, in Section 3.4.3 of their book Breiman et al. [25] define the one standard error rule (1 SE rule) for choosing an optimal tree size, and they implement it throughout the book. In order to calculate the standard error for single V-fold cross- validation, accuracy needs to be calculated for each fold, and the standard error is calculated from V accuracies from each fold. Hastie et al. [4] define the 1 SE rule as selecting the most parsimonious model whose error is no more than one standard error above the error of the best model, and they suggest in several places using the 1 SE rule for general cross-validation use. The main point of the 1 SE rule, with which we agree, is to choose the simplest model whose accuracy is comparable with the best model.

The suggestion is that the choice of one standard error is entirely heuristic, based on the sense that one standard error typically is not large relative to the range of $\lambda$ values.

Related Question