Solved – Is regression with L1 regularization the same as Lasso, and with L2 regularization the same as ridge regression? And how to write “Lasso”

lassoregressionregularizationridge regressionterminology

I'm a software engineer learning machine learning, particularly through Andrew Ng's machine learning courses. While studying linear regression with regularization, I've found terms that are confusing:

  • Regression with L1 regularization or L2 regularization
  • LASSO
  • Ridge regression

So my questions:

  1. Is regression with L1 regularization exactly the same as LASSO?

  2. Is regression with L2 regularization exactly the same as Ridge Regression?

  3. How is "LASSO" used in writing? Should it be "LASSO regression"? I've seen usage like "the lasso is more appropriate".

If the answer is "yes" for 1 and 2 above, then why are there different names for these two terms? Does "L1" and "L2" come from computer science / math, and "LASSO" and "Ridge" from stats?

The use of these terms is confusing when I see posts like:

"What is the difference between L1 and L2 regularization?" (quora.com)

"When should I use lasso vs ridge?" (stats.stackexchange.com)

Best Answer

  1. Yes.

  2. Yes.

  3. LASSO is actually an acronym (least absolute shrinkage and selection operator), so it ought to be capitalized, but modern writing is the lexical equivalent of Mad Max. On the other hand, Amoeba writes that even the statisticians who coined the term LASSO now use the lower-case rendering (Hastie, Tibshirani and Wainwright, Statistical Learning with Sparsity). One can only speculate as to the motivation for the switch. If you're writing for an academic press, they typically have a style guide for this sort of thing. If you're writing on this forum, either is fine, and I doubt anyone really cares.

The $L$ notation is a reference to Minkowski norms and $L^p$ spaces. These just generalize the notion of taxicab and Euclidean distances to $p>0$ in the following expression: $$ \|x\|_p=(|x_1|^p+|x_2|^p+...+|x_n|^p)^{\frac{1}{p}} $$ Importantly, only $p\ge 1$ defines a metric distance; $0<p<1$ does not satisfy the triangle inequality, so it is not a distance by most definitions.

I'm not sure when the connection between ridge and LASSO was realized.

As for why there are multiple names, it's just a matter that these methods developed in different places at different times. A common theme in statistics is that concepts often have multiple names, one for each sub-field in which it was independently discovered (kernel functions vs covariance functions, Gaussian process regression vs Kriging, AUC vs $c$-statistic). Ridge regression should probably be called Tikhonov regularization, since I believe he has the earliest claim to the method. Meanwhile, LASSO was only introduced in 1996, much later than Tikhonov's "ridge" method!