Solved – Mean squared error (MSE) prediction performance: Ridge vs Lasso

lassopredictionridge regression

It says that the ridge will outperform lasso in terms of prediction performance when the prediction metric is MSE, according to the answer to this post below:

If only prediction is of interest, why use lasso over ridge?

But why is that? Is there any somehow reasonable proof under certain assumptions? Or counterexamples?

Update after answers from @EdM

Here is a paper in 1995, Better subset regression using the nonnegative garrote, wherein the paragraph four mentions that the ridge regression gives more accurate predictions than subset regressions unless all but a few of the coefficients in linear regression are nearly zero and the rest are large.

Best Answer

As the answer from Frank Harrell on the page that you linked puts it:

Relative performance of the two will depend on the distribution of true regression coefficients. If you have a small fraction of nonzero coefficients in truth, lasso can perform better.

I don't know that there can be a general "proof" of when one or the other of ridge or LASSO will work better in terms of mean squared error (MSE). A reasonable way to approach this issue, however, is to consider the principle that you don't gain anything by throwing away information.

In textbook explanations of LASSO you will find scenarios where only a handful of uncorrelated predictors are truly related to outcome while the other predictors are random and unrelated to outcome. Applying that principle, when you truly do have a small fraction of non-zero coefficients, you are not throwing away any useful information by discarding the others. LASSO works well in such cases; ridge will necessarily keep some spurious coefficients whose values depend heavily on the sample at hand, leading to a model that might not generalize well out of sample.

In many real-life situations, however, you have a set of predictors that are correlated with each other while many are associated with outcome to some extent. In that case, if you use LASSO you will only choose one or a few predictors from a set of correlated predictors, and thus you throw away information the discarded predictors might provide for out-of-sample applications of your model. Ridge regression, in contrast, keeps some information from all of the predictors.

Whether or not it provides better predictions in terms of MSE, ridge at least should help to protect from the luck of the draw in situations where one of a set of correlated predictors has an unusually high relation to outcome in the sample at hand versus the population as a whole, and thus is selected and overweighted by LASSO. So unless practical considerations with very large numbers of predictors make it unwieldy, or you know going into the study that only a few predictors are related to outcome, ridge provides a reasonable general choice.