Consider ridge regression with an additional constraint requiring that $\hat{\mathbf y}$ has unit sum of squares (equivalently, unit variance); if needed, one can assume that $\mathbf y$ has unit sum of squares as well:
$$\hat{\boldsymbol\beta}_\lambda^* = \arg\min\Big\{\|\mathbf y – \mathbf X \boldsymbol \beta\|^2+\lambda\|\boldsymbol\beta\|^2\Big\} \:\:\text{s.t.}\:\: \|\mathbf X \boldsymbol\beta\|^2=1.$$
What is the limit of $\hat{\boldsymbol\beta}_\lambda^*$ when $\lambda\to\infty$?
Here are some statements that I believe are true:
-
When $\lambda=0$, there is a neat explicit solution: take OLS estimator $\hat{\boldsymbol\beta}_0=(\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf y$ and normalize it to satisfy the constraint (one can see this by adding a Lagrange multiplier and differentiating):
$$\hat{\boldsymbol\beta}_0^* = \hat{\boldsymbol\beta}_0 \big/ \|\mathbf X\hat{\boldsymbol\beta}_0\|.$$ -
In general, the solution is $$\hat{\boldsymbol\beta}_\lambda^*=\big((1+\mu)\mathbf X^\top \mathbf X + \lambda \mathbf I\big)^{-1}\mathbf X^\top \mathbf y\:\:\text{with $\mu$ needed to satisfy the constraint}.$$I don't see a closed form solution when $\lambda >0$. It seems that the solution is equivalent to the usual RR estimator with some $\lambda^*$ normalized to satisfy the constraint, but I don't see a closed formula for $\lambda^*$.
-
When $\lambda\to \infty$, the usual RR estimator $$\hat{\boldsymbol\beta}_\lambda=(\mathbf X^\top \mathbf X + \lambda \mathbf I)^{-1}\mathbf X^\top \mathbf y$$ obviously converges to zero, but its direction $\hat{\boldsymbol\beta}_\lambda \big/ \|\hat{\boldsymbol\beta}_\lambda\|$ converges to the direction of $\mathbf X^\top \mathbf y$, a.k.a. the first partial least squares (PLS) component.
Statements (2) and (3) together make me think that perhaps $\hat{\boldsymbol\beta}_\lambda^*$ also converges to the appropriately normalized $\mathbf X^\top \mathbf y$, but I am not sure if this is correct and I have not managed to convince myself either way.
Best Answer
#A geometrical interpretation
The estimator described in the question is the Lagrange multiplier equivalent of the following optimization problem:
$$\text{minimize $f(\beta)$ subject to $g(\beta) \leq t$ and $h(\beta) = 1$ } $$
$$\begin{align} f(\beta) &= \lVert y-X\beta \lVert^2 \\ g(\beta) &= \lVert \beta \lVert^2\\ h(\beta) &= \lVert X\beta \lVert^2 \end{align}$$
which can be viewed, geometrically, as finding the smallest ellipsoid $f(\beta)=\text{RSS }$ that touches the intersection of the sphere $g(\beta) = t$ and the ellipsoid $h(\beta)=1$
Comparison to the standard ridge regression view
In terms of a geometrical view this changes the old view (for standard ridge regression) of the point where a spheroid (errors) and sphere ($\|\beta\|^2=t$) touch. Into a new view where we look for the point where the spheroid (errors) touches a curve (norm of beta constrained by $\|X\beta\|^2=1$). The one sphere (blue in the left image) changes into a lower dimension figure due to the intersection with the $\|X\beta\|=1$ constraint.
In the two dimensional case this is simple to view.
When we tune the parameter $t$ then we change the relative length of the blue/red spheres or the relative sizes of $f(\beta)$ and $g(\beta)$ (In the theory of Lagrangian multipliers there is probably a neat way to formally and exactly describe that this means that for each $t$ as function of $\lambda$, or reversed, is a monotonous function. But I imagine that you can see intuitively that the sum of squared residuals only increases when we decrease $||\beta||$.)
The solution $\beta_\lambda$ for $\lambda=0$ is as you argued on a line between 0 and $\beta_{LS}$
The solution $\beta_\lambda$ for $\lambda \to \infty$ is (indeed as you commented) in the loadings of the first principal component. This is the point where $\lVert \beta \rVert^2$ is the smallest for $\lVert \beta X \rVert^2 = 1$. It is the point where the circle $\lVert \beta \rVert^2=t$ touches the ellipse $|X\beta|=1$ in a single point.
In this 2-d view the edges of the intersection of the sphere $\lVert \beta \rVert^2 =t$ and spheroid $\lVert \beta X \rVert^2 = 1$ are points. In multiple dimensions these will be curves
(I imagined first that these curves would be ellipses but they are more complicated. You could imagine the ellipsoid $\lVert X \beta \rVert^2 = 1$ being intersected by the ball $\lVert \beta \rVert^2 \leq t$ as some sort of ellipsoid frustum but with edges that are not a simple ellipses)
##Regarding the limit $\lambda \to \infty$
At first (previous edits) I wrote that there will be some limiting $\lambda_{lim}$ above which all the solutions are the same (and they reside in the point $\beta^*_\infty$). But this is not the case
Consider the optimization as a LARS algorithm or gradient descent. If for any point $\beta$ there is a direction in which we can change the $\beta$ such that the penalty term $|\beta|^2$ increases less than the SSR term $|y-X\beta|^2$ decreases then you are not in a minimum.