Ridge Regression – Proof of Equivalent Formulas and Lagrange Multipliers

lagrange multiplierslassoregularizationridge regression

I have read the most popular books in statistical learning

1- The elements of statistical learning.

2- An introduction to statistical learning.

Both mention that ridge regression has two formulas that are equivalent.
Is there an understandable mathematical proof of this result?

I also went through Cross Validated, but I can not find a definite proof there.

Furthermore, will LASSO enjoy the same type of proof?

enter image description here

Best Answer

The classic Ridge Regression (Tikhonov Regularization) is given by:

$$ \arg \min_{x} \frac{1}{2} {\left\| x - y \right\|}_{2}^{2} + \lambda {\left\| x \right\|}_{2}^{2} $$

The claim above is that the following problem is equivalent:

$$\begin{align*} \arg \min_{x} \quad & \frac{1}{2} {\left\| x - y \right\|}_{2}^{2} \\ \text{subject to} \quad & {\left\| x \right\|}_{2}^{2} \leq t \end{align*}$$

Let's define $ \hat{x} $ as the optimal solution of the first problem and $ \tilde{x} $ as the optimal solution of the second problem.

The claim of equivalence means that $ \forall t, \: \exists \lambda \geq 0 : \hat{x} = \tilde{x} $.
Namely you can always have a pair of $ t $ and $ \lambda \geq 0 $ such the solution of the problem is the same.

How could we find a pair?
Well, by solving the problems and looking at the properties of the solution.
Both problems are Convex and smooth so it should make things simpler.

The solution for the first problem is given at the point the gradient vanishes which means:

$$ \hat{x} - y + 2 \lambda \hat{x} = 0 $$

The KKT Conditions of the second problem states:

$$ \tilde{x} - y + 2 \mu \tilde{x} = 0 $$

and

$$ \mu \left( {\left\| \tilde{x} \right\|}_{2}^{2} - t \right) = 0 $$

The last equation suggests that either $ \mu = 0 $ or $ {\left\| \tilde{x} \right\|}_{2}^{2} = t $.

Pay attention that the 2 base equations are equivalent.
Namely if $ \hat{x} = \tilde{x} $ and $ \mu = \lambda $ both equations hold.

So it means that in case $ {\left\| y \right\|}_{2}^{2} \leq t $ one must set $ \mu = 0 $ which means that for $ t $ large enough in order for both to be equivalent one must set $ \lambda = 0 $.

On the other case one should find $ \mu $ where:

$$ {y}^{t} \left( I + 2 \mu I \right)^{-1} \left( I + 2 \mu I \right)^{-1} y = t $$

This is basically when $ {\left\| \tilde{x} \right\|}_{2}^{2} = t $

Once you find that $ \mu $ the solutions will collide.

Regarding the $ {L}_{1} $ (LASSO) case, well, it works with the same idea.
The only difference is we don't have closed for solution hence deriving the connection is trickier.

Have a look at my answer at StackExchange Cross Validated Q291962 and StackExchange Signal Processing Q21730 - Significance of $ \lambda $ in Basis Pursuit.

Remark
What's actually happening?
In both problems, $ x $ tries to be as close as possible to $ y $.
In the first case, $ x = y $ will vanish the first term (The $ {L}_{2} $ distance) and in the second case it will make the objective function vanish.
The difference is that in the first case one must balance $ {L}_{2} $ Norm of $ x $. As $ \lambda $ gets higher the balance means you should make $ x $ smaller.
In the second case there is a wall, you bring $ x $ closer and closer to $ y $ until you hit the wall which is the constraint on its Norm (By $ t $).
If the wall is far enough (High value of $ t $) and enough depends on the norm of $ y $ then i has no meaning, just like $ \lambda $ is relevant only of its value multiplied by the norm of $ y $ starts to be meaningful.
The exact connection is by the Lagrangian stated above.

Resources

I found this paper today (03/04/2019):