Solved – Understanding negative ridge regression

regressionregularizationridge regression

I'm looking for literature about negative ridge regression.

In short, it is a generalization of linear ridge regression using negative $\lambda$ in the estimator formula: $$\hat\beta = ( X^\top X + \lambda I)^{-1} X^\top y.$$ The positive case has a nice theory: as a loss function, as a constraint, as a Bayes prior… but I feel lost with the negative version with only the above formula. It happens to be useful for what I am doing but I fail to interpret it clearly.

Do you know any serious introductory text about negative ridge? How can it be interpreted?

Best Answer

Here is a geometric illustration of what is going on with negative ridge.

I will consider estimators of the form $$\hat{\boldsymbol\beta}_\lambda = (\mathbf X^\top \mathbf X + \lambda \mathbf I)^{-1}\mathbf X^\top\mathbf y$$ arising from the loss function $$\mathcal L_\lambda = \|\mathbf y - \mathbf X\boldsymbol\beta\|^2 + \lambda \|\boldsymbol\beta\|^2.$$ Here is a rather standard illustration of what happens in a two-dimensional case with $\lambda\in[0,\infty)$. Zero lambda corresponds to the OLS solution, infinite lambda shrinks the estimated beta to zero:

enter image description here

Now consider what happens when $\lambda\in(-\infty, -s^2_\max)$, where $s_\mathrm{max}$ is the largest singular value of $\mathbf X$. For very large negative lambdas, $\hat{\boldsymbol\beta}_\lambda$ is of course close to zero. When lambda approaches $-s^2_\max$, the term $(\mathbf X^\top \mathbf X + \lambda \mathbf I)$ gets one singular value approaching zero, meaning that the inverse has one singular value going to minus infinity. This singular value corresponds to the first principal component of $\mathbf X$, so in the limit one gets $\hat{\boldsymbol\beta}_\lambda$ pointing in the direction of PC1 but with absolute value growing to infinity.

What is really nice, is that one can draw it on the same figure in the same way: betas are given by points where circles touch the ellipses from the inside:

enter image description here

When $\lambda\in(-s^2_\mathrm{min},0]$, a similar logic applies, allowing to continue the ridge path on the other side of the OLS estimator. Now the circles touch the ellipses from the outside. In the limit, betas approach the PC2 direction (but it happens far outside this sketch):

enter image description here

The $(-s^2_\mathrm{max}, -s^2_\mathrm{min})$ range is something of an energy gap: estimators there do not live on the same curve.

UPDATE: In the comments @MartinL explains that for $\lambda<-s^2_\mathrm{max}$ the loss $\mathcal L_\lambda$ does not have a minimum but has a maximum. And this maximum is given by $\hat{\boldsymbol\beta}_\lambda$. This is why the same geometric construction with the circle/ellipse touching keeps working: we are still looking for zero-gradient points. When $-s^2_\mathrm{min}<\lambda\le 0$, the loss $\mathcal L_\lambda$ does have a minimum and it is given by $\hat{\boldsymbol\beta}_\lambda$, exactly as in the normal $\lambda>0$ case.

But when $-s^2_\mathrm{max}<\lambda<-s^2_\mathrm{min}$, the loss $\mathcal L_\lambda$ does not have either maximum or minimum; $\hat{\boldsymbol\beta}_\lambda$ would correspond to a saddle point. This explains the "energy gap".


The $\lambda\in(-\infty, -s^2_\max)$ naturally arises from a particular constrained ridge regression, see The limit of "unit-variance" ridge regression estimator when $\lambda\to\infty$. This is related to what is known in the chemometrics literature as "continuum regression", see my answer in the linked thread.

The $\lambda\in(-s^2_\mathrm{min},0]$ can be treated in exactly the same way as $\lambda>0$: the loss function stays the same and the ridge estimator provides its minimum.