Ridge Regression – How to Derive Ridge Regression Solution

least squaresregressionregularizationridge regression

I am having some issues with the derivation of the solution for ridge regression.

I know the regression solution without the regularization term:

$$\beta = (X^TX)^{-1}X^Ty.$$

But after adding the L2 term $\lambda\|\beta\|_2^2$ to the cost function, how come the solution becomes

$$\beta = (X^TX + \lambda I)^{-1}X^Ty.$$

Best Answer

It suffices to modify the loss function by adding the penalty. In matrix terms, the initial quadratic loss function becomes $$ (Y - X\beta)^{T}(Y-X\beta) + \lambda \beta^T\beta.$$ Deriving with respect to $\beta$ leads to the normal equation $$ X^{T}Y = \left(X^{T}X + \lambda I\right)\beta $$ which leads to the Ridge estimator.

Related Solutions

L2-Norm Regularization – Is There a Closed Form Solution for L2-Norm Regularized Linear Regression?

You will get the ridge regression solutions, but parametrised differently in terms of the penalty parameter $\lambda$. This holds more generally for convex loss functions.

If $L$ is a convex, differentiable function of $\beta$ let $\beta(\lambda)$ denote the unique minimiser of the strictly convex function $$h(\beta) = L(\beta) + \lambda \|\beta\|_2^2$$ for $\lambda > 0$. Let, furthermore, $s(\lambda) = \|\beta(\lambda)\|_2$.

Consider now the function $$g(\beta) = L(\beta) + 2 \lambda s(\lambda) \|\beta\|_2.$$ Its Jacobian is $$Dg(\beta) = DL(\beta) + 2 \lambda s(\lambda) \frac{\beta}{\|\beta\|_2}.$$ If we plug in $\beta(\lambda)$ we find that $$Dg(\beta(\lambda)) = DL(\beta(\lambda)) + 2 \lambda \beta(\lambda) = Dh(\beta(\lambda) = 0,$$ because $\beta(\lambda)$ is a stationary point of $h$. Since $g$ is still convex this shows that $\beta(\lambda)$ is a global minimiser of $g$.

It is possible that $\lambda \mapsto \lambda s(\lambda)$ does not map $(0, \infty)$ onto $(0,\infty)$, thus there can be choices of the penalty parameter $-$ when the $\|\cdot\|_2$-penalty and not the $\|\cdot\|_2^2$-penalty is used $-$ that give minimisers that are not of the form $\beta(\lambda)$ for any $\lambda > 0$. With the squared error loss (yielding ridge regression) this will be the case for large choices of the penalty parameter, where the $\|\cdot\|_2$-penalty will give the zero solution.

Solved – Understanding negative ridge regression

Here is a geometric illustration of what is going on with negative ridge.

I will consider estimators of the form $$\hat{\boldsymbol\beta}_\lambda = (\mathbf X^\top \mathbf X + \lambda \mathbf I)^{-1}\mathbf X^\top\mathbf y$$ arising from the loss function $$\mathcal L_\lambda = \|\mathbf y - \mathbf X\boldsymbol\beta\|^2 + \lambda \|\boldsymbol\beta\|^2.$$ Here is a rather standard illustration of what happens in a two-dimensional case with $\lambda\in[0,\infty)$. Zero lambda corresponds to the OLS solution, infinite lambda shrinks the estimated beta to zero:

Now consider what happens when $\lambda\in(-\infty, -s^2_\max)$, where $s_\mathrm{max}$ is the largest singular value of $\mathbf X$. For very large negative lambdas, $\hat{\boldsymbol\beta}_\lambda$ is of course close to zero. When lambda approaches $-s^2_\max$, the term $(\mathbf X^\top \mathbf X + \lambda \mathbf I)$ gets one singular value approaching zero, meaning that the inverse has one singular value going to minus infinity. This singular value corresponds to the first principal component of $\mathbf X$, so in the limit one gets $\hat{\boldsymbol\beta}_\lambda$ pointing in the direction of PC1 but with absolute value growing to infinity.

What is really nice, is that one can draw it on the same figure in the same way: betas are given by points where circles touch the ellipses from the inside:

When $\lambda\in(-s^2_\mathrm{min},0]$, a similar logic applies, allowing to continue the ridge path on the other side of the OLS estimator. Now the circles touch the ellipses from the outside. In the limit, betas approach the PC2 direction (but it happens far outside this sketch):

The $(-s^2_\mathrm{max}, -s^2_\mathrm{min})$ range is something of an energy gap: estimators there do not live on the same curve.

UPDATE: In the comments @MartinL explains that for $\lambda<-s^2_\mathrm{max}$ the loss $\mathcal L_\lambda$ does not have a minimum but has a maximum. And this maximum is given by $\hat{\boldsymbol\beta}_\lambda$. This is why the same geometric construction with the circle/ellipse touching keeps working: we are still looking for zero-gradient points. When $-s^2_\mathrm{min}<\lambda\le 0$, the loss $\mathcal L_\lambda$ does have a minimum and it is given by $\hat{\boldsymbol\beta}_\lambda$, exactly as in the normal $\lambda>0$ case.

But when $-s^2_\mathrm{max}<\lambda<-s^2_\mathrm{min}$, the loss $\mathcal L_\lambda$ does not have either maximum or minimum; $\hat{\boldsymbol\beta}_\lambda$ would correspond to a saddle point. This explains the "energy gap".

The $\lambda\in(-\infty, -s^2_\max)$ naturally arises from a particular constrained ridge regression, see The limit of "unit-variance" ridge regression estimator when $\lambda\to\infty$. This is related to what is known in the chemometrics literature as "continuum regression", see my answer in the linked thread.

The $\lambda\in(-s^2_\mathrm{min},0]$ can be treated in exactly the same way as $\lambda>0$: the loss function stays the same and the ridge estimator provides its minimum.

Best Answer

Related Solutions

L2-Norm Regularization – Is There a Closed Form Solution for L2-Norm Regularized Linear Regression?

Solved – Understanding negative ridge regression

Related Question