Regularization – What Problem Do Shrinkage Methods Solve

larslassoregularizationridge regression

The holiday season has given me the opportunity to curl up next to the fire with The Elements of Statistical Learning. Coming from a (frequentist) econometrics perspective, I'm having trouble grasping the uses of shrinkage methods like ridge regression, lasso, and least angle regression (LAR). Typically, I'm interested in the parameter estimates themselves and in achieving unbiasedness or at least consistency. Shrinkage methods don't do that.

It seems to me that these methods are used when the statistician is worried that the regression function becomes too responsive to the predictors, that it considers the predictors to be more important (measured by the magnitude of the coefficients) than they actually are. In other words, overfitting.

But, OLS typically provides unbiased and consistent estimates.(footnote) I've always viewed the problem of overfitting not of giving estimates that are too big, but rather confidence intervals that are too small because the selection process isn't taken into account (ESL mentions this latter point).

Unbiased/consistent coefficient estimates lead to unbiased/consistent predictions of the outcome. Shrinkage methods push predictions closer to the mean outcome than OLS would, seemingly leaving information on the table.

To reiterate, I don't see what problem the shrinkage methods are trying to solve. Am I missing something?

Footnote: We need the full column rank condition for identification of the coefficients. The exogeneity/zero conditional mean assumption for the errors and the linear conditional expectation assumption determine the interpretation that we can give to the coefficients, but we get an unbiased or consistent estimate of something even if these assumptions aren't true.

Best Answer

I suspect you want a deeper answer, and I'll have to let someone else provide that, but I can give you some thoughts on ridge regression from a loose, conceptual perspective.

OLS regression yields parameter estimates that are unbiased (i.e., if such samples are gathered and parameters are estimated indefinitely, the sampling distribution of parameter estimates will be centered on the true value). Moreover, the sampling distribution will have the lowest variance of all possible unbiased estimates (this means that, on average, an OLS parameter estimate will be closer to the true value than an estimate from some other unbiased estimation procedure will be). This is old news (and I apologize, I know you know this well), however, the fact that the variance is lower does not mean that it is terribly low. Under some circumstances, the variance of the sampling distribution can be so large as to make the OLS estimator essentially worthless. (One situation where this could occur is when there is a high degree of multicollinearity.)

What is one to do in such a situation? Well, a different estimator could be found that has lower variance (although, obviously, it must be biased, given what was stipulated above). That is, we are trading off unbiasedness for lower variance. For example, we get parameter estimates that are likely to be substantially closer to the true value, albeit probably a little below the true value. Whether this tradeoff is worthwhile is a judgment the analyst must make when confronted with this situation. At any rate, ridge regression is just such a technique. The following (completely fabricated) figure is intended to illustrate these ideas.

enter image description here

This provides a short, simple, conceptual introduction to ridge regression. I know less about lasso and LAR, but I believe the same ideas could be applied. More information about the lasso and least angle regression can be found here, the "simple explanation..." link is especially helpful. This provides much more information about shrinkage methods.

I hope this is of some value.