Say I want to estimate a large number of parameters, and I want to penalize some of them because I believe they should have little effect compared to the others. How do I decide what penalization scheme to use? When is ridge regression more appropriate? When should I use lasso?
Lasso vs Ridge – When to Use Each in Regression
lassoregressionridge regression
Related Solutions
If you order 1 million ridge-shrunk, scaled, but non-zero features, you will have to make some kind of decision: you will look at the n best predictors, but what is n? The LASSO solves this problem in a principled, objective way, because for every step on the path (and often, you'd settle on one point via e.g. cross validation), there are only m coefficients which are non-zero.
Very often, you will train models on some data and then later apply it to some data not yet collected. For example, you could fit your model on 50.000.000 emails and then use that model on every new email. True, you will fit it on the full feature set for the first 50.000.000 mails, but for every following email, you will deal with a much sparser and faster, and much more memory efficient, model. You also won't even need to collect the information for the dropped features, which may be hugely helpful if the features are expensive to extract, e.g. via genotyping.
Another perspective on the L1/L2 problem exposed by e.g. Andrew Gelman is that you often have some intuition what your problem may be like. In some circumstances, it is possible that reality is truly sparse. Maybe you have measured millions of genes, but it is plausible that only 30.000 of them actually determine dopamine metabolism. In such a situation, L1 arguably fits the problem better.
In other cases, reality may be dense. For example, in psychology, "everything correlates (to some degree) with everything" (Paul Meehl). Preferences for apples vs. oranges probably does correlate with political leanings somehow - and even with IQ. Regularization might still make sense here, but true zero effects should be rare, so L2 might be more appropriate.
Yes.
Yes.
LASSO is actually an acronym (least absolute shrinkage and selection operator), so it ought to be capitalized, but modern writing is the lexical equivalent of Mad Max. On the other hand, Amoeba writes that even the statisticians who coined the term LASSO now use the lower-case rendering (Hastie, Tibshirani and Wainwright, Statistical Learning with Sparsity). One can only speculate as to the motivation for the switch. If you're writing for an academic press, they typically have a style guide for this sort of thing. If you're writing on this forum, either is fine, and I doubt anyone really cares.
The $L$ notation is a reference to Minkowski norms and $L^p$ spaces. These just generalize the notion of taxicab and Euclidean distances to $p>0$ in the following expression: $$ \|x\|_p=(|x_1|^p+|x_2|^p+...+|x_n|^p)^{\frac{1}{p}} $$ Importantly, only $p\ge 1$ defines a metric distance; $0<p<1$ does not satisfy the triangle inequality, so it is not a distance by most definitions.
I'm not sure when the connection between ridge and LASSO was realized.
As for why there are multiple names, it's just a matter that these methods developed in different places at different times. A common theme in statistics is that concepts often have multiple names, one for each sub-field in which it was independently discovered (kernel functions vs covariance functions, Gaussian process regression vs Kriging, AUC vs $c$-statistic). Ridge regression should probably be called Tikhonov regularization, since I believe he has the earliest claim to the method. Meanwhile, LASSO was only introduced in 1996, much later than Tikhonov's "ridge" method!
Best Answer
Keep in mind that ridge regression can't zero out coefficients; thus, you either end up including all the coefficients in the model, or none of them. In contrast, the LASSO does both parameter shrinkage and variable selection automatically. If some of your covariates are highly correlated, you may want to look at the Elastic Net [3] instead of the LASSO.
I'd personally recommend using the Non-Negative Garotte (NNG) [1] as it's consistent in terms of estimation and variable selection [2]. Unlike LASSO and ridge regression, NNG requires an initial estimate that is then shrunk towards the origin. In the original paper, Breiman recommends the least-squares solution for the initial estimate (you may however want to start the search from a ridge regression solution and use something like GCV to select the penalty parameter).
In terms of available software, I've implemented the original NNG in MATLAB (based on Breiman's original FORTRAN code). You can download it from:
http://www.emakalic.org/blog/wp-content/uploads/2010/04/nngarotte.zip
BTW, if you prefer a Bayesian solution, check out [4,5].
References:
[1] Breiman, L. Better Subset Regression Using the Nonnegative Garrote Technometrics, 1995, 37, 373-384
[2] Yuan, M. & Lin, Y. On the non-negative garrotte estimator Journal of the Royal Statistical Society (Series B), 2007, 69, 143-161
[3] Zou, H. & Hastie, T. Regularization and variable selection via the elastic net Journal of the Royal Statistical Society (Series B), 2005, 67, 301-320
[4] Park, T. & Casella, G. The Bayesian Lasso Journal of the American Statistical Association, 2008, 103, 681-686
[5] Kyung, M.; Gill, J.; Ghosh, M. & Casella, G. Penalized Regression, Standard Errors, and Bayesian Lassos Bayesian Analysis, 2010, 5, 369-412