In an unpenalized regression, you can often get a ridge* in parameter space, where many different values along the ridge all do as well or nearly as well on the least squares criterion.

* (at least, it's a ridge in the *likelihood function* -- they're actually *valleys*$ in the RSS criterion, but I'll continue to call it a ridge, as this seems to be conventional -- or even, as Alexis points out in comments, I could call that a *thalweg*, being the valley's counterpart of a ridge)

In the presence of a ridge in the least squares criterion in parameter space, the penalty you get with ridge regression gets rid of those ridges by pushing the criterion up as the parameters head away from the origin:

[Clearer image]

In the first plot, a large change in parameter values (along the ridge) produces a miniscule change in the RSS criterion. This can cause numerical instability; it's very sensitive to small changes (e.g. a tiny change in a data value, even truncation or rounding error). The parameter estimates are almost perfectly correlated. You may get parameter estimates that are very large in magnitude.

By contrast, by lifting up the thing that ridge regression minimizes (by adding the $L_2$ penalty) when the parameters are far from 0, small changes in conditions (such as a little rounding or truncation error) can't produce gigantic changes in the resulting estimates. The penalty term results in shrinkage toward 0 (resulting in some bias). A small amount of bias can buy a substantial improvement in the variance (by eliminating that ridge).

The uncertainty of the estimates are reduced (the standard errors are inversely related to the second derivative, which is made larger by the penalty).

Correlation in parameter estimates is reduced. You now won't get parameter estimates that are very large in magnitude if the RSS for small parameters would not be much worse.

Say we are optimizing a model with parameters $\vec{\theta}$, by minimizing some criterion $f(\vec{\theta})$ subject to a constraint on the magnitude of the parameter vector (for instance to implement a structural risk minimization approach by constructing a nested set of models of increasing complexity), we would need to solve:

$\mathrm{min}_\vec{\theta} f(\vec{\theta}) \quad \mathrm{s.t.} \quad \|\vec{\theta}\|^2 < C$

The Lagrangian for this problem is (caveat: I think, its been a long day... ;-)

$\Lambda(\vec{\theta},\lambda) = f(\vec{\theta}) + \lambda\|\vec{\theta}\|^2 - \lambda C.$

So it can easily be seen that a regularized cost function is closely related to a constrained optimization problem with the regularization parameter $\lambda$ being related to the constant governing the constraint ($C$), and is essentially the Lagrange multiplier. The $-\lambda C$ term is just an additive constant, so it doesn't change the solution of the optimisation problem if it is omitted, just the value of the objective function.

This illustrates why e.g. ridge regression implements structural risk minimization: Regularization is equivalent to putting a constraint on the magnitude of the weight vector and if $C_1 > C_2$ then every model that can be made while obeying the constraint that

$\|\vec{\theta}\|^2 < C_2$

will also be available under the constraint

$\|\vec{\theta}\|^2 < C_1$.

Hence reducing $\lambda$ generates a sequence of hypothesis spaces of increasing complexity.

## Best Answer

There are two formulations for the ridge problem. The first one is

$$\boldsymbol{\beta}_R = \operatorname*{argmin}_{\boldsymbol{\beta}} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right)^{\prime} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right)$$

subject to

$$\sum_{j} \beta_j^2 \leq s. $$

This formulation shows the size constraint on the regression coefficients. Note what this constraint implies; we are forcing the coefficients to lie in a ball around the origin with radius $\sqrt{s}$.

The second formulation is exactly your problem

$$\boldsymbol{\beta}_R = \operatorname*{argmin}_{\boldsymbol{\beta}} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right)^{\prime} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right) + \lambda \sum\beta_j^2 $$

which may be viewed as the Largrange multiplier formulation. Note that here $\lambda$ is a tuning parameter and larger values of it will lead to greater shrinkage. You may proceed to differentiate the expression with respect to $\boldsymbol{\beta}$ and obtain the well-known ridge estimator

$$\boldsymbol{\beta}_{R} = \left( \mathbf{X}^{\prime} \mathbf{X} + \lambda \mathbf{I} \right)^{-1} \mathbf{X}^{\prime} \mathbf{y} \tag{1}$$

The two formulations are completely equivalent, since there is a one-to-one correspondence between $s$ and $\lambda$.Let me elaborate a bit on that. Imagine that you are in the ideal orthogonal case, $\mathbf{X}^{\prime} \mathbf{X} = \mathbf{I}$. This is a highly simplified and unrealistic situation but we can investigate the estimator a little more closely so bear with me. Consider what happens to equation (1). The ridge estimator reduces to

$$\boldsymbol{\beta}_R = \left( \mathbf{I} + \lambda \mathbf{I} \right)^{-1} \mathbf{X}^{\prime} \mathbf{y} = \left( \mathbf{I} + \lambda \mathbf{I} \right)^{-1} \boldsymbol{\beta}_{OLS} $$

as in the orthogonal case the OLS estimator is given by $\boldsymbol{\beta}_{OLS} = \mathbf{X}^{\prime} \mathbf{y}$. Looking at this component-wise now we obtain

$$\beta_R = \frac{\beta_{OLS}}{1+\lambda} \tag{2}$$

Notice then that now the shrinkage is constant for all coefficients. This might not hold in the general case and indeed it can be shown that the shrinkages will differ widely if there are degeneracies in the $\mathbf{X}^{\prime} \mathbf{X}$ matrix.

But let's return to the constrained optimization problem. By the KKT theory, a

necessarycondition for optimality is$$\lambda \left( \sum \beta_{R,j} ^2 -s \right) = 0$$

so either $\lambda = 0$ or $\sum \beta_{R,j} ^2 -s = 0$ (in this case we say that the constraint is binding). If $\lambda = 0$ then there is no penalty and we are back in the regular OLS situation. Suppose then that the constraint is binding and we are in the second situation. Using the formula in (2), we then have

$$ s = \sum \beta_{R,j}^2 = \frac{1}{\left(1 + \lambda \right)^2} \sum \beta_{OLS,j}^2$$

whence we obtain

$$\lambda = \sqrt{\frac{\sum \beta_{OLS,j} ^2}{s}} - 1 $$

the one-to-one relationship previously claimed. I expect this is harder to establish in the non-orthogonal case but the result carries regardless.

Look again at (2) though and you'll see we are still missing the $\lambda$. To get an optimal value for it, you may either use cross-validation or look at the ridge trace. The latter method involves constructing a sequence of $\lambda$ in (0,1) and looking how the estimates change. You then select the $\lambda$ that stabilizes them. This method was suggested in the second of the references below by the way and is the oldest one.

References