Solved – Is regression with L1 regularization the same as Lasso, and with L2 regularization the same as ridge regression? And how to write “Lasso”

lassoregressionregularizationridge regressionterminology

I'm a software engineer learning machine learning, particularly through Andrew Ng's machine learning courses. While studying linear regression with regularization, I've found terms that are confusing:

Regression with L1 regularization or L2 regularization
LASSO
Ridge regression

So my questions:

Is regression with L1 regularization exactly the same as LASSO?
Is regression with L2 regularization exactly the same as Ridge Regression?
How is "LASSO" used in writing? Should it be "LASSO regression"? I've seen usage like "the lasso is more appropriate".

If the answer is "yes" for 1 and 2 above, then why are there different names for these two terms? Does "L1" and "L2" come from computer science / math, and "LASSO" and "Ridge" from stats?

The use of these terms is confusing when I see posts like:

"What is the difference between L1 and L2 regularization?" (quora.com)

"When should I use lasso vs ridge?" (stats.stackexchange.com)

Best Answer

Yes.
Yes.
LASSO is actually an acronym (least absolute shrinkage and selection operator), so it ought to be capitalized, but modern writing is the lexical equivalent of Mad Max. On the other hand, Amoeba writes that even the statisticians who coined the term LASSO now use the lower-case rendering (Hastie, Tibshirani and Wainwright, Statistical Learning with Sparsity). One can only speculate as to the motivation for the switch. If you're writing for an academic press, they typically have a style guide for this sort of thing. If you're writing on this forum, either is fine, and I doubt anyone really cares.

The $L$ notation is a reference to Minkowski norms and $L^p$ spaces. These just generalize the notion of taxicab and Euclidean distances to $p>0$ in the following expression: $$ \|x\|_p=(|x_1|^p+|x_2|^p+...+|x_n|^p)^{\frac{1}{p}} $$ Importantly, only $p\ge 1$ defines a metric distance; $0<p<1$ does not satisfy the triangle inequality, so it is not a distance by most definitions.

I'm not sure when the connection between ridge and LASSO was realized.

As for why there are multiple names, it's just a matter that these methods developed in different places at different times. A common theme in statistics is that concepts often have multiple names, one for each sub-field in which it was independently discovered (kernel functions vs covariance functions, Gaussian process regression vs Kriging, AUC vs $c$-statistic). Ridge regression should probably be called Tikhonov regularization, since I believe he has the earliest claim to the method. Meanwhile, LASSO was only introduced in 1996, much later than Tikhonov's "ridge" method!

Related Solutions

Solved – How to find parameters for ridge and lasso regularization when cost minimization does not converge

Is the question "why is cost() so flat near [0 0 0 0] ?",
or "why does fmin_xx not find a minimum on a flat surface ?"
A couple of suggestions anyway:

1) look at cost() near x0, with look( f, x0, h ) -> f() at all corners of a cube of side 2h around x0, or if that's too many at a random subset.
2) is [0 0 0 0] a reasonable start point for weights ?
3) what happened to the first 3, continuous, columns of X ?
4) start with fmin() a.k.a. Nelder-Mead (at the best dim + 1 points from look()) before running fmin_xx -- more powerful but harder to drive.
5) scikit-learn SGDClassifier got to

97.3 % correct  SGDClassifier  uciml/ad*
    X (1572, 1558) Xtest (787, 1558)  -- 2/3, 1/3 split
    centre 3  -- scale all feature columns to [-1, 1]
    sgditer 100  loss log  penalty l2
11 sec
Confusion matrix: 97.3 % correct = 766 / 787
True classes down, estimated across  / true class sizes
0:  652    8  /  660  99 %
1:   13  114  /  127  90 %

Solved – How to choose the right number of parameters in Logistic Regression

Distortion of statistical properties can occur when you "fit to the data", so I think of this more in terms of specifying the number of parameters that I can afford to estimate and that I want to devote to the portion of the model that pertains to that one predictor. I use regression splines, place knots where $X$ is dense, and specify the number of knots (or the number of parameters and back calculate the number of knots) by asking (1) what does the sample size and distribution of $Y$ support and (2) what is the signal:noise ratio in this dataset. When $n \uparrow$ or signal:noise ratio $\uparrow$ I can use more knots. There is no set formula for the number of parameters that should be fitted, although in a minority of situations you can use cross-validation or AIC to determine this. As you mentioned, shrinkage is a great alternative, because you can start out with many parameters then shrink the coefficients down to what cross-validation or effective AIC dictate.

Best Answer

Related Solutions

Solved – How to find parameters for ridge and lasso regularization when cost minimization does not converge

Solved – How to choose the right number of parameters in Logistic Regression

Related Question