Solved – Applying duality and KKT conditions to LASSO problem

lassooptimization

I'm having some difficulties understanding how duality leads to the common form of LASSO problem and with Karush-Kuhn-Tucker condition called complementary slackness. I have two questions:

  1. We know that, given an optimization problem
    $$
    \begin{align*}
    &\min_x f(x)\\
    &s.t. \quad h_i(x) \leq 0 \, ,\quad i=1,\dots, m
    \end{align*}
    $$

solving this is equivalent to solving the dual problem
$$
\begin{align*}
&\max_\lambda \,\, g(\lambda) \\
&s.t. \quad \lambda \geq 0
\end{align*}
$$
with $g(\lambda) = min_\lambda \bigr\{f(x) + \sum_{i=1}^m \lambda_i h_i(x)\bigr \}$

In the LASSO problem, the primal is
$$
\begin{align*}
&||y-X\beta ||_2^2 \\
&s.t. \,\,\,\, ||\beta ||_1 \leq t
\end{align*}
$$

so, if my understanding is correct, for the dual problem we should get
$$
g(\lambda) = \min_\beta ||y-X\beta ||_2^2 + \lambda( ||\beta ||_1 -t)
$$

However, the LASSO problem is always specified as
$$
\min_\beta ||y-X\beta ||_2^2 + \lambda ||\beta ||_1
$$

what am I missing? Is it related to the derivative of a constant, which is null?

  1. The second question is: I saw many authors presenting the solution to the LASSO problem by just solving the stationarity KKT condition
    $$ X^T(y-X\beta)=\lambda s $$

I understand that, as the problem is convex, the primal and dual feasibility conditions are satisfied, anyway I don't see why we don't check for complementary slackness condition.

Best Answer

1) You're going the wrong direction by invoking duality directly. To get from

$\text{arg min}_{\beta: \|\beta\|_1 \leq t} \|y - X\beta\|_2^2$

to

$\text{arg min}_{\beta} \|y - X\beta\|_2^2 + \lambda\|\beta\|_1$

you just need to invoke Lagrange multipliers. (See, e.g. Section 5.1 of [1])

LMs are often discussed in the context of duality when teaching them, but in practice you can just switch directly from one to the other without considering the dual problem.

If you are interested in the dual problem of the lasso, it's worked out on Slides 12 and 13 of [2]

2) What you have probably seen is the KKT Stationarity condition for the Lasso:

$\text{arg min}\frac{1}{2}\|y - X\beta\|_2^2 + \lambda \|\beta\|_1 \Longleftrightarrow -X^T(y - X\hat{\beta}) + \lambda s = 0 \text{ for some } s \in \partial \|\hat{\beta}\|_1$

where $\partial \|\beta\|_1$ is called the subdifferential of the $\ell_1$ norm. (This is essentially just the standard "derivative equals zero at minimum" condition from calculus, but adjusted for non-differentiability.)

We know the subdifferential of $|\beta_i| = \text{sign}(\beta_i)$ if $\beta_i \neq 0$ so this equation gives an exact closed form solution for the lasso if we know the support and sign of the solution. Namely,

$\hat{\beta}_{\hat{S}} = (X_{\hat{S}}^TX_{\hat{S}})^{-1}(X_{\hat{S}}^Ty - \lambda * \text{sign}(\hat{\beta}_{\hat{S}}))$

(Aside: this solution makes the "shrinkage" effect of the lasso (as compared to OLS) very clear.)

Of course, the hard part of solving the lasso is finding the support and signs of the solution, so this is not terribly helpful in practice.

It is, however, a very useful theoretical construct and can be used to prove lots of nice properties of the lasso; most importantly, it lets us use the "primal-dual witness" technique to establish conditions under which the lasso recovers the "true" set of variables. See Section 11.4 of [3].

[1] S. Boyd and L. Vandenberghe. Convex Optimization. Available at https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

[2] http://www.stat.cmu.edu/~ryantibs/convexopt-F15/lectures/13-dual-corres.pdf

[3] T. Hastie, R. Tibshirani, M. Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. Available at https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS_corrected_1.4.16.pdf

Related Question