Solved – If only prediction is of interest, why use lasso over ridge

lassomachine learningpredictionregularizationridge regression

On page 223 in An Introduction to Statistical Learning, the authors summarise the differences between ridge regression and lasso. They provide an example (Figure 6.9) of when "lasso tends to outperform ridge regression in terms of bias, variance, and MSE".

I understand why lasso can be desirable: it results in sparse solutions since it shrinks many coefficients to 0, resulting in simple and interpretable models. But I do not understand how it can outperform ridge when only predictions are of interest (i.e. how is it getting a substantially lower MSE in the example?).

With ridge, if many predictors have almost no affect on the response (with a few predictors having a large effect), won't their coefficients simply be shrunk to a small number very close to zero… resulting in something very similar to lasso? So why would the final model have worse performance than lasso?

Best Answer

You are right to ask this question. In general, when a proper accuracy scoring rule is used (e.g., mean squared prediction error), ridge regression will outperform lasso. Lasso spends some of the information trying to find the "right" predictors and it's not even great at doing that in many cases. Relative performance of the two will depend on the distribution of true regression coefficients. If you have a small fraction of nonzero coefficients in truth, lasso can perform better. Personally I use ridge almost all the time when interested in predictive accuracy.