Solved – Ridge regression formulation as constrained versus penalized: How are they equivalent

regressionridge regression

I seem to be misunderstanding a claim about linear regression methods that
I've seen in various places. The parameters of the problem are:

Input:

$N$ data samples of $p+1$ quantities each consisting of a "response" quantity $y_i$ and $p$
"predictor" quantities $x_{ij}$

The result desired is a "good linear fit" which predicts the response based on the
predictors where a good fit has small differences between the prediction and the
observed response (among other criteria).

Output: $p+1$ coefficients $\beta_j$ where $\beta_0 + \sum_{j=1}^p x_{ij} * \beta_j$ is a "good fit"
for predicting the response quantity from the predictor quantities.

I'm confused about the
"ridge regression" approach to this problem.
In "The Elements of Statistical Learning" by Hastie, Tibshirani, and Friedman page 63
ridge regression is formulated in two ways.

First as the constrained optimization problem:

$$
{argmin}_\beta \sum_{i=1}^N {
(
y_i –
(\beta_0 + \sum_{j=1}^p (x_{ij} * \beta_j))
)^2
}
$$
subject to the constraint
$$
\sum_{j=1}^p \beta_i^2 \leq t
$$
for some positive parameter t.

Second is the penalized optimization problem:
$$
{argmin}_\beta
( \lambda \sum_{j=1}^p { \beta_j^2 } ) +
\sum_{i=1}^N {
(
y_i –
(\beta_0 + \sum_{j=1}^p (x_{ij} * \beta_j))
)^2
}
$$
for some positive parameter $\lambda$.

The text says that these formulations are equivalent and
that there is a "one to one correspondence between the parameters $\lambda$ and $t$".
I've seen this claim (and similar ones) in several places in addition to this book.
I think I am missing something because I don't see how the formulations
are equivalent as I understand it.

Consider the case where $N=2$ and $p=1$ with $y_1=0$, $x_{1,1}=0$ and $y_2=1$, $x_{1,2}=1$. Choosing the parameter
$t=2$ the constrained formulation becomes:

$$
{argmin}_{\beta_0,\beta_1} (
\beta_0^2 +
(1 – (\beta_0 + \beta_1))^2
)
$$

expanded to

$$
{argmin}_{\beta_0,\beta_1} (
2 \beta_{0}^{2} + 2 \beta_{0} \beta_{1} – 2 \beta_{0} + \beta_{1}^{2} – 2 \beta_{1} + 1
)
$$

To solve this find the solution where the partial derivatives with respect to $\beta_0$ and $\beta_1$
are zero:
$$ 4 \beta_{0} + 2 \beta_{1} – 2 = 0 $$
$$ 2 \beta_{0} + 2 \beta_{1} – 2 = 0 $$
with solution $\beta_0 = 0$ and $\beta_1 = 1$. Note that $\beta_0^2 + \beta_1^2 \le t$ as required.

How does this derivation relate to the other formulation?
According to the explanation there is some value of $\lambda$ uniquely corresponding
to $t$ where if we optimize the penalized formulation of the problem we will derive
the same $\beta_0$ and $\beta_1$. In this case the penalized form becomes
$$
{argmin}_{\beta_0,\beta_1} (
\lambda (\beta_0^2 + \beta_1^2) +
\beta_0^2 +
(1 – (\beta_0 + \beta_1))^2
)
$$
expanded to
$$
{argmin}_{\beta_0,\beta_1} (
\beta_{0}^{2} \lambda + 2 \beta_{0}^{2} + 2 \beta_{0} \beta_{1} – 2 \beta_{0} + \beta_{1}^{2} \lambda + \beta_{1}^{2} – 2 \beta_{1} + 1
)
$$
To solve this find the solution where the partial derivatives with respect to $\beta_0$ and $\beta_1$
are zero:
$$ 2 \beta_{0} \lambda + 4 \beta_{0} + 2 \beta_{1} – 2 = 0 $$
$$ 2 \beta_{0} + 2 \beta_{1} \lambda + 2 \beta_{1} – 2 = 0 $$
for these equations I get the solution
$$ \beta_0 = \lambda/(\lambda^2 + 3\lambda + 1) $$
$$ \beta_1 = (\lambda + 1)/((\lambda + 1)(\lambda + 2) – 1) $$
If that is correct the only way to get $\beta_0 = 0$ is to set $\lambda = 0$. However that
would be the same $\lambda$ we would need for $t = 4$, so what do they mean by "one to one
correspondence"?

In summary I'm totally confused by the two presentations and I don't understand how they
correspond to each other. I don't understand how you can optimize one form and get the same
solution for the other form or how $\lambda$ is related to $t$. This is just one instance
of this kind of correspondence — there are others for other approaches such as lasso —
and I don't understand any of them.

Someone please help me.

Best Answer

The classic Ridge Regression (Tikhonov Regularization) is given by:

$$ \arg \min_{x} \frac{1}{2} {\left\| x - y \right\|}_{2}^{2} + \lambda {\left\| x \right\|}_{2}^{2} $$

The claim above is that the following problem is equivalent:

$$\begin{align*} \arg \min_{x} \quad & \frac{1}{2} {\left\| x - y \right\|}_{2}^{2} \\ \text{subject to} \quad & {\left\| x \right\|}_{2}^{2} \leq t \end{align*}$$

Let's define $ \hat{x} $ as the optimal solution of the first problem and $ \tilde{x} $ as the optimal solution of the second problem.

The claim of equivalence means that $ \forall t, \: \exists \lambda \geq 0 : \hat{x} = \tilde{x} $.
Namely you can always have a pair of $ t $ and $ \lambda \geq 0 $ such the solution of the problem is the same.

How could we find a pair?
Well, by solving the problems and looking at the properties of the solution.
Both problems are Convex and smooth so it should make things simpler.

The solution for the first problem is given at the point the gradient vanishes which means:

$$ \hat{x} - y + 2 \lambda \hat{x} = 0 $$

The KKT Conditions of the second problem states:

$$ \tilde{x} - y + 2 \mu \tilde{x} = 0 $$

and

$$ \mu \left( {\left\| \tilde{x} \right\|}_{2}^{2} - t \right) = 0 $$

The last equation suggests that either $ \mu = 0 $ or $ {\left\| \tilde{x} \right\|}_{2}^{2} = t $.

Pay attention that the 2 base equations are equivalent.
Namely if $ \hat{x} = \tilde{x} $ and $ \mu = \lambda $ both equations hold.

So it means that in case $ {\left\| y \right\|}_{2}^{2} \leq t $ one must set $ \mu = 0 $ which means that for $ t $ large enough in order for both to be equivalent one must set $ \lambda = 0 $.

On the other case one should find $ \mu $ where:

$$ {y}^{t} \left( I + 2 \mu I \right)^{-1} \left( I + 2 \mu I \right)^{-1} y = t $$

This is basically when $ {\left\| \tilde{x} \right\|}_{2}^{2} = t $

Once you find that $ \mu $ the solutions will collide.

Regarding the $ {L}_{1} $ case, well, it works with the same idea.
The only difference is we don't have closed for solution hence deriving the connection is trickier.

Have a look at my answer at StackExchange Cross Validated Q291962 and StackExchange Signal Processing Q21730 - Significance of $ \lambda $ in Basis Pursuit.