Solved – elastic net regularization, and how does it solve the drawbacks of Ridge ($L^2$) and Lasso ($L^1$)

elastic netlassoregressionregularizationridge regression

Is elastic net regularization always preferred to Lasso & Ridge since it seems to solve the drawbacks of these methods? What is the intuition and what is the math behind elastic net?

Best Answer

1. Which method is preferred?

Yes, elastic net is always preferred over lasso & ridge regression because it solves the limitations of both methods, while also including each as special cases. So if the ridge or lasso solution is, indeed, the best, then any good model selection routine will identify that as part of the modeling process.

Comments to my post have pointed out that the advantages of elastic net are not unqualified. I persist in my belief that the generality of the elastic net regression is still preferable to either $L^1$ or $L^2$ regularization on its own. Specifically, I think that the points of contention between myself and others are directly tied to what assumptions we are willing to make about the modeling process. In the presence of strong knowledge about the underlying data, some methods will be preferred to others. However, my preference for elastic net is rooted in my skepticism that one will confidently know that $L^1$ or $L^2$ is the true model.

  1. Claim: Prior knowledge may obviate one of the need to use elastic net regression.

This is somewhat circular. Forgive me if this is somewhat glib, but if you know that LASSO (ridge) is the best solution, then you won't ask yourself how to appropriately model it; you'll just fit a LASSO (ridge) model. If you're absolutely sure that the correct answer is LASSO (ridge) regression, then you're clearly convinced that there would be no reason to waste time fitting an elastic net. But if you're slightly less certain whether LASSO (ridge) is the correct way to proceed, I believe it makes sense to estimate a more flexible model, and evaluate how strongly the data support the prior belief.

  1. Claim: Modestly large data will not permit discovery of $L^1$ or $L^2$ solutions as preferred, even in cases when the $L^1$ or $L^2$ solution is the true model.

This is also true, but I think it's circular for a similar reason: if you've estimated an optimal solution and find that $\alpha\not\in \{0,1\},$ then that's the model that the data support. On the one hand, yes, your estimated model is not the true model, but I must wonder how one would know that the true model is $\alpha=1$ (or $\alpha=0$) prior to any model estimation. There might be domains where you have this kind of prior knowledge, but my professional work is not one of them.

  1. Claim: Introducing additional hyperparameters increases the computational cost of estimating the model.

This is only relevant if you have tight time/computer limitations; otherwise it's just a nuisance. GLMNET is the gold-standard algorithm for estimating elastic net solutions. The user supplies some value of alpha, and it uses the path properties of the regularization solution to quickly estimate a family of models for a variety of values of the penalization magnitude $\lambda$, and it can often estimate this family of solutions more quickly than estimating just one solution for a specific value $\lambda$. So, yes, using GLMNET does consign you to the domain of using grid-style methods (iterate over some values of $\alpha$ and let GLMNET try a variety of $\lambda$s), but it's pretty fast.

  1. Claim: Improved performance of elastic net over LASSO or ridge regression is not guaranteed.

This is true, but at the step where one is contemplating which method to use, one will not know which of elastic net, ridge or LASSO is the best. If one reasons that the best solution must be LASSO or ridge regression, then we're in the domain of claim (1). If we're still uncertain which is best, then we can test LASSO, ridge and elastic net solutions, and make a choice of a final model at that point (or, if you're an academic, just write your paper about all three). This situation of prior uncertainty will either place us in the domain of claim (2), where the true model is LASSO/ridge but we did not know so ahead of time, and we accidentally select the wrong model due to poorly identified hyperparameters, or elastic net is actually the best solution.

  1. Claim: Hyperparameter selection without cross-validation is highly biased and error-prone.

Proper model validation is an integral part of any machine learning enterprise. Model validation is usually an expensive step, too, so one would seek to minimize inefficiencies here -- if one of those inefficiencies is needlessly trying $\alpha$ values that are known to be futile, then one suggestion might be to do so. Yes, by all means do that, if you're comfortable with the strong statement that you're making about how your data are arranged -- but we're back to the territory of claim (1) and claim (2).

2. What's the intuition and math behind elastic net?

I strongly suggest reading the literature on these methods, starting with the original paper on the elastic net. The paper develops the intuition and the math, and is highly readable. Reproducing it here would only be to the detriment of the authors' explanation. But the high-level summary is that the elastic net is a convex sum of ridge and lasso penalties, so the objective function for a Gaussian error model looks like $$\text{Residual Mean Square Error}+\alpha \cdot \text{Ridge Penalty}+(1-\alpha)\cdot \text{LASSO Penalty}$$

for $\alpha\in[0,1].$

Hui Zou and Trevor Hastie. "Regularization and variable selection via the elastic net." J. R. Statistic. Soc., vol 67 (2005), Part 2., pp. 301-320.

Richard Hardy points out that this is developed in more detail in Hastie et al. "The Elements of Statistical Learning" chapters 3 and 18.

3. What if you add additional $L^q$ norms?

This is a question posed to me in the comments:

Let me suggest one further argument against your point of view that elastic net is uniformly better than lasso or ridge alone. Imagine that we add another penalty to the elastic net cost function, e.g. an $L^3$ cost, with a hyperparameter $\gamma$. I don't think there is much research on that, but I would bet you that if you do a cross-validation search on a 3d parameter grid, then you will get $\gamma\not =0$ as the optimal value. If so, would you then argue that it is always a good idea to include $L^3$ cost too.

I appreciate that the spirit of the question is "If it's as you claim and two penalties are good, why not add another?" But I think the answer lies in why we regularize in the first place.

$L^1$ regularization tends to produce sparse solutions, but also tends to select the feature most strongly correlated with the outcome and zero out the rest. Moreover, in a data set with $n$ observations, it can select at most $n$ features. $L_2$ regularization is suited to deal with ill-posed problems resulting from highly (or perfectly) correlated features. In a data set with $p$ features, $L_2$ regularization can be used to uniquely identify a model in the $p>n$ case.

Setting aside either of these problems, the regularized model can still out-perform the ML model because the shrinkage properties of the estimators are "pessimistic" and pull coefficients toward 0.

But I am not aware of the statistical properties for $L^3$ regularization. In the problems I've worked on, we generally face both problems: the inclusion of poorly correlated features (hypotheses that are not borne out by the data), and co-linear features.

Indeed, there are compelling reasons that $L^1$ and $L^2$ penalties on parameters are the only ones typically used.

In Why do we only see $L_1$ and $L_2$ regularization but not other norms?, @whuber offers this comment:

I haven't investigated this question specifically, but experience with similar situations suggests there may be a nice qualitative answer: all norms that are second differentiable at the origin will be locally equivalent to each other, of which the $L^2$ norm is the standard. All other norms will not be differentiable at the origin and $L^1$ qualitatively reproduces their behavior. That covers the gamut. In effect, a linear combination of an $L^1$ and $L^2$ norm approximates any norm to second order at the origin--and this is what matters most in regression without outlying residuals.

So we can effectively cover the range of options which could possibly be provided by $L^q$ norms as combinations of $L^1$ and $L^2$ norms -- all without requiring additional hyperparameter tuning.