Solved – Lagrangian relaxation in the context of ridge regression

ridge regression

In "The Elements of Statistical Learning" (2nd ed), p63, the authors give the following two formulations of the ridge regression problem:

$$ \hat{\beta}^{ridge} = \underset{\beta}{\operatorname{argmin}} \left\{ \sum_{i=1}^N(y_i-\beta_0-\sum_{j=1}^p x_{ij} \beta_j)^2 + \lambda \sum_{j=1}^p \beta_j^2 \right\} $$

and

$$ \hat{\beta}^{ridge} = \underset{\beta}{\operatorname{argmin}} \sum_{i=1}^N(y_i-\beta_0-\sum_{j=1}^p x_{ij} \beta_j)^2 \text{, subject to } \sum_{j=1}^p \beta_j^2 \leq t.$$

It is claimed that the two are equivalent, and that there is a one-to-one correspondence between the parameters $\lambda$ and $t$.

It would appear that the first formulation is a Lagrangian relaxation of the second. However, I never had an intuitive understanding of how or why Lagrangian relaxations work.

Is there a simple way to demonstrate that the two formulations are indeed equivalent? If I have to choose, I'd prefer intuition over rigour.

Thanks.

Best Answer

The correspondence can most easily be shown using the Envelope Theorem.

First, the standard Lagrangian will have an additional $\lambda \cdot t$ term. This will not affect the maximization problem if we are just treating $\lambda$ as given, so Hastie et al drop it.

Now, if you differentiate the full Lagrangian with respect to $t$, the Envelope Theorem says you can ignore the indirect effects of $t$ through $\beta$, because you're at a maximum. What you'll be left with is the Lagrange multipler from $\lambda \cdot t$.

But what does this mean intuitively? Since the constraint binds at the maximum, the derivative of the Lagrangian, evaluated at the maximum, is the same as the deriviate the original objective. Therefore the Lagrange multiplier gives the shadow price -- the value in terms of the objective -- of relaxing the constraint by increasing $t$.

I assume this is the correspondence Hastie et al. are referring to.

Best Answer

Related Solutions

Solved – Equivalence between Elastic Net formulations

Solved – Why is using centered or uncentered data equivalent in ridge regression

Related Question