Solved – The limit of “unit-variance” ridge regression estimator when $\lambda\to\infty$

constrained regressionpartial least squarespcaregularizationridge regression

Consider ridge regression with an additional constraint requiring that $\hat{\mathbf y}$ has unit sum of squares (equivalently, unit variance); if needed, one can assume that $\mathbf y$ has unit sum of squares as well:

$$\hat{\boldsymbol\beta}_\lambda^* = \arg\min\Big\{\|\mathbf y – \mathbf X \boldsymbol \beta\|^2+\lambda\|\boldsymbol\beta\|^2\Big\} \:\:\text{s.t.}\:\: \|\mathbf X \boldsymbol\beta\|^2=1.$$

What is the limit of $\hat{\boldsymbol\beta}_\lambda^*$ when $\lambda\to\infty$?

Here are some statements that I believe are true:

When $\lambda=0$, there is a neat explicit solution: take OLS estimator $\hat{\boldsymbol\beta}_0=(\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf y$ and normalize it to satisfy the constraint (one can see this by adding a Lagrange multiplier and differentiating):
$$\hat{\boldsymbol\beta}_0^* = \hat{\boldsymbol\beta}_0 \big/ \|\mathbf X\hat{\boldsymbol\beta}_0\|.$$
In general, the solution is $$\hat{\boldsymbol\beta}_\lambda^*=\big((1+\mu)\mathbf X^\top \mathbf X + \lambda \mathbf I\big)^{-1}\mathbf X^\top \mathbf y\:\:\text{with $\mu$ needed to satisfy the constraint}.$$I don't see a closed form solution when $\lambda >0$. It seems that the solution is equivalent to the usual RR estimator with some $\lambda^*$ normalized to satisfy the constraint, but I don't see a closed formula for $\lambda^*$.
When $\lambda\to \infty$, the usual RR estimator $$\hat{\boldsymbol\beta}_\lambda=(\mathbf X^\top \mathbf X + \lambda \mathbf I)^{-1}\mathbf X^\top \mathbf y$$ obviously converges to zero, but its direction $\hat{\boldsymbol\beta}_\lambda \big/ \|\hat{\boldsymbol\beta}_\lambda\|$ converges to the direction of $\mathbf X^\top \mathbf y$, a.k.a. the first partial least squares (PLS) component.

Statements (2) and (3) together make me think that perhaps $\hat{\boldsymbol\beta}_\lambda^*$ also converges to the appropriately normalized $\mathbf X^\top \mathbf y$, but I am not sure if this is correct and I have not managed to convince myself either way.

Best Answer

#A geometrical interpretation

The estimator described in the question is the Lagrange multiplier equivalent of the following optimization problem:

$$\text{minimize $f(\beta)$ subject to $g(\beta) \leq t$ and $h(\beta) = 1$ } $$

$$\begin{align} f(\beta) &= \lVert y-X\beta \lVert^2 \\ g(\beta) &= \lVert \beta \lVert^2\\ h(\beta) &= \lVert X\beta \lVert^2 \end{align}$$

which can be viewed, geometrically, as finding the smallest ellipsoid $f(\beta)=\text{RSS }$ that touches the intersection of the sphere $g(\beta) = t$ and the ellipsoid $h(\beta)=1$

Comparison to the standard ridge regression view

In terms of a geometrical view this changes the old view (for standard ridge regression) of the point where a spheroid (errors) and sphere ($\|\beta\|^2=t$) touch. Into a new view where we look for the point where the spheroid (errors) touches a curve (norm of beta constrained by $\|X\beta\|^2=1$). The one sphere (blue in the left image) changes into a lower dimension figure due to the intersection with the $\|X\beta\|=1$ constraint.

In the two dimensional case this is simple to view.

When we tune the parameter $t$ then we change the relative length of the blue/red spheres or the relative sizes of $f(\beta)$ and $g(\beta)$ (In the theory of Lagrangian multipliers there is probably a neat way to formally and exactly describe that this means that for each $t$ as function of $\lambda$, or reversed, is a monotonous function. But I imagine that you can see intuitively that the sum of squared residuals only increases when we decrease $||\beta||$.)

The solution $\beta_\lambda$ for $\lambda=0$ is as you argued on a line between 0 and $\beta_{LS}$

The solution $\beta_\lambda$ for $\lambda \to \infty$ is (indeed as you commented) in the loadings of the first principal component. This is the point where $\lVert \beta \rVert^2$ is the smallest for $\lVert \beta X \rVert^2 = 1$. It is the point where the circle $\lVert \beta \rVert^2=t$ touches the ellipse $|X\beta|=1$ in a single point.

In this 2-d view the edges of the intersection of the sphere $\lVert \beta \rVert^2 =t$ and spheroid $\lVert \beta X \rVert^2 = 1$ are points. In multiple dimensions these will be curves

(I imagined first that these curves would be ellipses but they are more complicated. You could imagine the ellipsoid $\lVert X \beta \rVert^2 = 1$ being intersected by the ball $\lVert \beta \rVert^2 \leq t$ as some sort of ellipsoid frustum but with edges that are not a simple ellipses)

##Regarding the limit $\lambda \to \infty$

At first (previous edits) I wrote that there will be some limiting $\lambda_{lim}$ above which all the solutions are the same (and they reside in the point $\beta^*_\infty$). But this is not the case

Consider the optimization as a LARS algorithm or gradient descent. If for any point $\beta$ there is a direction in which we can change the $\beta$ such that the penalty term $|\beta|^2$ increases less than the SSR term $|y-X\beta|^2$ decreases then you are not in a minimum.

In normal ridge regression you have a zero slope (in all directions) for $|\beta|^2$ in the point $\beta=0$. So for all finite $\lambda$ the solution can not be $\beta = 0$ (since an infinitesimal step can be made to reduce the sum of squared residuals without increasing the penalty).
For LASSO this is not the same since: the penalty is $\lvert \beta \rvert_1$ (so it is not quadratic with zero slope). Because of that LASSO will have some limiting value $\lambda_{lim}$ above which all the solutions are zero because the penalty term (multiplied by $\lambda$) will increase more than the residual sum of squares decreases.
For the constrained ridge you get the same as the regular ridge regression. If you change the $\beta$ starting from the $\beta^*_\infty$ then this change will be perpendicular to $\beta$ (the $\beta^*_\infty$ is perpendicular to the surface of the ellipse $|X\beta|=1$) and $\beta$ can be changed by an infinitesimal step without changing the penalty term but decreasing the sum of squared residuals. Thus for any finite $\lambda$ the point $\beta^*_\infty$ can not be the solution.

##Further notes regarding the limit $\lambda \to \infty$

The usual ridge regression limit for $\lambda$ to infinity corresponds to a different point in the constrained ridge regression. This 'old' limit corresponds to the point where $\mu$ is equal to -1. Then the derivative of the Lagrange function in the normalized problem

$$2 (1+\mu) X^{T}X \beta + 2 X^T y + 2 \lambda \beta$$ corresponds to a solution for the derivative of the Lagrange function in the standard problem

$$2 X^{T}X \beta^\prime + 2 X^T y + 2 \frac{\lambda}{(1+\mu)} \beta^\prime \qquad \text{with $\beta^\prime = (1+\mu)\beta$}$$

Best Answer

Comparison to the standard ridge regression view

Related Solutions

Solved – Variance-covariance matrix for ridge regression with stochastic $\lambda$

Solved – Under exactly what conditions is ridge regression able to provide an improvement over ordinary least squares regression

Variance of Ridge Estimator

Comment

Why is ridge regression usually recommended only in the case of correlated predictors?

Related Question