Solved – Why does glmnet use “naive” elastic net from the Zou & Hastie original paper

elastic netglmnetregressionregularization

The original elastic net paper Zou & Hastie (2005) Regularization and variable selection via the elastic net introduced elastic net loss function for linear regression (here I assume all variables are centered and scaled to unit variance): $$\mathcal L = \frac{1}{n}\big\lVert y – X\beta\big\rVert^2 + \lambda_1\lVert \beta\rVert_1 + \lambda_2 \lVert \beta\rVert^2_2,$$ but called it "naive elastic net". They argued that it performs double shrinkage (lasso and ridge), tends to over-shrink, and can be improved by rescaling the resulting solution as follows: $$\hat\beta^* = (1+\lambda_2)\hat\beta.$$ They gave some theoretical arguments and experimental evidence that this leads to better performance.

However, the subsequent glmnet paper Friedman, Hastie, & Tibshirani (2010) Regularization paths for generalized linear models via coordinate descent did not use this rescaling and only had a brief footnote saying

Zou and Hastie (2005) called this penalty the naive elastic net, and preferred a rescaled version which they called elastic net. We drop this distinction here.

No further explanation is given there (or in any of the Hastie et al. textbooks). I find it somewhat puzzling. Did the authors leave the rescaling out because they considered it too ad hoc? because it performed worse in some further experiments? because it was not clear how to generalize it to the GLM case? I have no idea. But in any case the glmnet package became very popular since then and so my impression is that nowadays nobody is using the rescaling from Zou & Hastie, and most people are probably not even aware about this possibility.

Question: after all, was this rescaling a good idea or a bad idea?

With glmnet parametrization, Zou & Hastie rescaling should be $$\hat\beta^* = \big(1+\lambda(1-\alpha)\big)\hat\beta.$$

Best Answer

I emailed this question to Zou and to Hastie and got the following reply from Hastie (I hope he wouldn't mind me quoting it here):

I think in Zou et al we were worried about the additional bias, but of course rescaling increases the variance. So it just shifts one along the bias-variance tradeoff curve. We will soon be including a version of relaxed lasso which is a better form of rescaling.

I interpret these words as an endorsement of some form of "rescaling" of the vanilla elastic net solution, but Hastie does not anymore seem to stand by the particular approach put forward in Zou & Hastie 2005.

In the following I will briefly review and compare several rescaling options.

I will be using glmnet parametrization of the loss $$\mathcal L = \frac{1}{2n}\big\lVert y - \beta_0-X\beta\big\rVert^2 + \lambda\big(\alpha\lVert \beta\rVert_1 + (1-\alpha) \lVert \beta\rVert^2_2/2\big),$$ with the solution denoted as $\hat\beta$.

The approach of Zou & Hastie is to use $$\hat\beta_\text{rescaled} = \big(1+\lambda(1-\alpha)\big)\hat\beta.$$ Note that this yields some non-trivial rescaling for pure ridge when $\alpha=0$ which arguably does not make a lot of sense. On the other hand, this yields no rescaling for pure lasso when $\alpha=1$, despite various claims in the literature that lasso estimator could benefit from some rescaling (see below).
For pure lasso, Tibshirani suggested to use lasso-OLS hybrid, i.e. to use OLS estimator using the subset of predictors selected by lasso. This makes the estimator consistent (but undoes the shrinkage, which can increase the expected error). One can use the same approach for elastic net $$\hat\beta_\text{elastic-OLS-hybrid}= \text{OLS}(X_i\mid\hat\beta_i\ne 0)$$ but the potential problem is that elastic net can select more than $n$ predictors and OLS will break down (in contrast, pure lasso never selects more than $n$ predictors).
Relaxed lasso mentioned in the Hastie's email quoted above is a suggestion to run another lasso on the subset of predictors selected by the first lasso. The idea is to use two different penalties and to select both via cross-validation. One could apply the same idea to elastic net, but this would seem to require four different regularization parameters and tuning them is a nightmare.

I suggest a simpler relaxed elastic net scheme: after obtaining $\hat\beta$, perform ridge regression with $\alpha=0$ and the same $\lambda$ on the selected subset of predictors: $$\hat\beta_\text{relaxed-elastic-net}= \text{Ridge}(X_i\mid\hat\beta_i\ne 0).$$ This (a) does not require any additional regularization parameters, (b) works for any number of selected predictors, and (c) does not do anything if one starts with pure ridge. Sounds good to me.

I am currently working with a small $n\ll p$ dataset with $n=44$ and $p=3000$, where $y$ is well predicted by the few leading PCs of $X$. I will compare the performance of the above estimators using 100x repeated 11-fold cross-validation. As a performance metric, I am using test error, normalized to yield something like an R-squared: $$R^2_\text{test} = 1-\frac{\lVert y_\text{test} - \hat\beta_0 - X_\text{test}\hat\beta\rVert^2}{\lVert y_\text{test} - \hat\beta_0\rVert^2}.$$ In the figure below, dashed lines correspond to the vanilla elastic net estimator $\hat\beta$ and three subplots correspond to the three rescaling approaches:

So, at least in these data, all three approaches outperform the vanilla elastic net estimator, and "relaxed elastic net" performs the best.

Best Answer

Related Solutions

Solved – Which lambda is cv.glmnet solving for

How to go from Elastic Net Loss to Scikit-Learn Elastic Net

Related Question