Solved – Why does shrinkage work

intuitionlassoregularizationridge regression

In order to solve problems of model selection, a number of methods (LASSO, ridge regression, etc.) will shrink the coefficients of predictor variables towards zero. I am looking for an intuitive explanation of why this improves predictive ability. If the true effect of the variable was actually very large, why doesn't shrinking the parameter result in a worse prediction?

Best Answer

Roughly speaking, there are three different sources of prediction error:

  1. the bias of your model
  2. the variance of your model
  3. unexplainable variance

We can't do anything about point 3 (except for attempting to estimate the unexplained variance and incorporating it in our predictive densities and prediction intervals). This leaves us with 1 and 2.

If you actually have the "right" model, then, say, OLS parameter estimates will be unbiased and have minimal variance among all unbiased (linear) estimators (they are BLUE). Predictions from an OLS model will be best linear unbiased predictions (BLUPs). That sounds good.

However, it turns out that although we have unbiased predictions and minimal variance among all unbiased predictions, the variance can still be pretty large. More importantly, we can sometimes introduce "a little" bias and simultaneously save "a lot" of variance - and by getting the tradeoff just right, we can get a lower prediction error with a biased (lower variance) model than with an unbiased (higher variance) one. This is called the "bias-variance tradeoff", and this question and its answers is enlightening: When is a biased estimator preferable to unbiased one?

And regularization like the lasso, ridge regression, the elastic net and so forth do exactly that. They pull the model towards zero. (Bayesian approaches are similar - they pull the model towards the priors.) Thus, regularized models will be biased compared to non-regularized models, but also have lower variance. If you choose your regularization right, the result is a prediction with a lower error.

If you search for "bias-variance tradeoff regularization" or similar, you get some food for thought. This presentation, for instance, is useful.

EDIT: amoeba quite rightly points out that I am handwaving as to why exactly regularization yields lower variance of models and predictions. Consider a lasso model with a large regularization parameter $\lambda$. If $\lambda\to\infty$, your lasso parameter estimates will all be shrunk to zero. A fixed parameter value of zero has zero variance. (This is not entirely correct, since the threshold value of $\lambda$ beyond which your parameters will be shrunk to zero depends on your data and your model. But given the model and the data, you can find a $\lambda$ such that the model is the zero model. Always keep your quantifiers straight.) However, the zero model will of course also have a giant bias. It doesn't care about the actual observations, after all.

And the same applies to not-all-that-extreme values of your regularization parameter(s): small values will yield the unregularized parameter estimates, which will be less biased (unbiased if you have the "correct" model), but have higher variance. They will "jump around", following your actual observations. Higher values of your regularization $\lambda$ will "constrain" your parameter estimates more and more. This is why the methods have names like "lasso" or "elastic net": they constrain the freedom of your parameters to float around and follow the data.

(I am writing up a little paper on this, which will hopefully be rather accessible. I'll add a link once it's available.)