Solved – Why does ridge estimate become better than OLS by adding a constant to the diagonal

least squaresregressionregularizationridge regression

I understand that the ridge regression estimate is the $\beta$ that minimizes residual sum of square and a penalty on the size of $\beta$

$$\beta_\mathrm{ridge} = (\lambda I_D + X'X)^{-1}X'y = \operatorname{argmin}\big[ \text{RSS} + \lambda \|\beta\|^2_2\big]$$

However, I don't fully understand the significance of the fact that $\beta_\text{ridge}$ differs from $\beta_\text{OLS}$ by only adding a small constant to the diagonal of $X'X$. Indeed,

$$\beta_\text{OLS} = (X'X)^{-1}X'y$$

  1. My book mentions that this makes the estimate more stable numerically — why?

  2. Is numerical stability related to the shrinkage towards 0 of the ridge estimate, or it's just a coincidence?

Best Answer

In an unpenalized regression, you can often get a ridge* in parameter space, where many different values along the ridge all do as well or nearly as well on the least squares criterion.

* (at least, it's a ridge in the likelihood function -- they're actually valleys$ in the RSS criterion, but I'll continue to call it a ridge, as this seems to be conventional -- or even, as Alexis points out in comments, I could call that a thalweg, being the valley's counterpart of a ridge)

In the presence of a ridge in the least squares criterion in parameter space, the penalty you get with ridge regression gets rid of those ridges by pushing the criterion up as the parameters head away from the origin:

enter image description here
[Clearer image]

In the first plot, a large change in parameter values (along the ridge) produces a miniscule change in the RSS criterion. This can cause numerical instability; it's very sensitive to small changes (e.g. a tiny change in a data value, even truncation or rounding error). The parameter estimates are almost perfectly correlated. You may get parameter estimates that are very large in magnitude.

By contrast, by lifting up the thing that ridge regression minimizes (by adding the $L_2$ penalty) when the parameters are far from 0, small changes in conditions (such as a little rounding or truncation error) can't produce gigantic changes in the resulting estimates. The penalty term results in shrinkage toward 0 (resulting in some bias). A small amount of bias can buy a substantial improvement in the variance (by eliminating that ridge).

The uncertainty of the estimates are reduced (the standard errors are inversely related to the second derivative, which is made larger by the penalty).

Correlation in parameter estimates is reduced. You now won't get parameter estimates that are very large in magnitude if the RSS for small parameters would not be much worse.