Solved – LASSO: Deriving the smallest lambda at which all coefficient are zero

algebraic-statisticslassomachine learningmathematical-statistics

I'am trying to derive mathematically the lowest $\lambda$ at which all coefficients are zero, reffering to the Exercise 2.1 of the book Sparsity with Statistical Learning.
The result should be $\lambda_{min}=\max_j |X_j^Ty|$. I hope this isnt a duplicate, because it is also discussed here. However my math skills are very restricted, so I don't understand the solution. I hope that someone can help me to understand the mathematical details behind the derivation.
The LASSO is defined as:

$\min_\beta \frac{1}{2}||y-X\beta||_2^2 +\lambda||\beta||_1$

My approach was following the KKT-conditions, we get

$X^T (y-X\hat{\beta})=\lambda s$

with $s$ as the subgradient.
$\beta=0$ leads to

$X^Ty=\lambda s \Leftrightarrow \frac{X^Ty}{s}=\lambda$

And here im stucking. How I can get from there to this $\lambda_{min}=\max_j |X_j^Ty| $

EDIT: The Exercise question of the book is:
"Show that the smallest value of such that the regression coefficients
estimated by the lasso are all equal to zero is given by"

$\lambda_{max}=|\frac{1}{N}\langle x_j,y\rangle|$

EDIT2: Source of the KKT-enter image description hereCondition

Best Answer

I think that the question is not so much a duplicate and a useful addition. The issue of finding the smallest $\lambda$ for which all coefficients are zero has occurred before on this website but not explicitly as a question (which make this question an useful reference point).


Something is wrong with your formulation using the KKT conditions

I do not know the Karush–Kuhn–Tucker (KKT) conditions for non differentiable functions but the equation (9), $X^T(y-X\hat\beta) = \lambda s$, that you quote does not seem right, it is probably a typo or something like that. The left hand side is the gradient of $\frac{1}{2}||y-X\beta||_2^2$ but it should be the directional derivative ($\nabla_\vec{s}$) instead (or something related).

If we explain it in other words (not using the KKT conditions) then you can see it as following

  • For finding a (local) minimum of the function below

    $$\mathcal{L}(\beta) = \frac{1}{2}||y-X\beta||_2^2 +\lambda||\beta||_1 $$

    you should have in all directions $\vec{s}$ the following condition

    $$\nabla_{\vec{s}} \mathcal{L}(\beta) \geq 0 $$

    in words: for a local maximum there is no direction $\vec{s}$ along which you can change $\beta$ to make the objective function $\mathcal{L}(\beta)$ decrease.

    (This is slightly different than the typical derivative being equal to zero. We use an inequality instead of equality to zero because the function is not smooth).

    This changes the question into finding the $\lambda$ for which

    $$\max_\vec{s} \left[ \nabla_{\vec{s}} \mathcal{L}(0) \right] = 0 $$

    or (when $\beta = 0 $)

    $$\max_\vec{s} \left[ \left(X^Ty\right) \cdot \vec{s} - \lambda \Vert \vec{s} \Vert_1\right] = 0$$

    after some shifting of terms

    $$\max_\vec{s} \left[ \left(X^Ty\right) \cdot \frac{\vec{s}}{\Vert \vec{s} \Vert_1} \right] = \lambda$$


An intuitive geometrical interpretation with LARS regression

To explain the problem more intuitively we are going to use a geometrical interpretation of the algorithm for LARS (least angle regression. LARS gives, at least, for $\beta$ sufficiently close to zero gives the same set of solutions as LASSO)

With LARS solutions for the optimization problem are found as a function of $\lambda$. These solutions are found by following a path along which the residual sum of squares term decreases the most for a given change in the norm term $\beta$.

Define $\lambda_{max}$ such that for $\lambda < \lambda_{max}$ we have at least one non-zero coefficient, and for $\lambda \geq \lambda_{max}$ we have all coefficients zero).

For $\lambda>\lambda_{max}$ we have $\hat{\beta}^\lambda$ = 0. The penalty in the term $\lambda \vert \vert \beta \vert \vert_1$ will be too large, and an increase/change of $\beta$ (imagine gradient descent to find an optimum) does not reduce $\frac{1}{2} \Vert y-X \beta \Vert ^2_2$ sufficiently to get a reduction in $\frac{1}{2} \vert \vert y-X \beta \vert \vert ^2_2 + \lambda \vert \vert \beta \vert \vert_1$

For $\lambda=\lambda_{max}$ The basic idea is that we can decrease $\lambda$ untill some point where a change of $\beta$ is possible such that an increase of $\vert \vert \beta \vert \vert_1$ will cause the term $(1/2) \vert \vert y-X \beta \vert \vert ^2_2$ to decreases more than the term $\lambda \vert \vert \beta \vert \vert_1$ increases. To find this $\lambda_{max}$:

  • Determine the angle, or unit vector $\beta_0$, along which the sum $(y - X\beta)^2$ would decrease most (for given change of $\Vert \beta \Vert_0$). (in the same way as the first step in the LARS algorithm explained in the very nice article from professor Efron and others, http://www-stat.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf) This would relate to the variable(s) $x_i$ that correlates the most with $y$.
  • Calculate the rate of change at which the sum of squares of the error change if you increase the length of the predicted vector $\hat{y}$

    $r_1= \frac{1}{2} \frac{\partial\sum{(y - X\beta_0\cdot s)^2}}{\partial s}$

    this is related to the angle between $\vec{\beta_0}$ and $\vec{y}$. The change of the square of the length of the error term $y_{err}$ will be equal to

    $$\frac{\partial y_{err}^2}{\partial \vert \vert \beta_0 \vert \vert _1} = 2 y_{err} \frac{\partial y_{err}}{\partial \vert \vert \beta_0 \vert \vert _1} = 2 \vert \vert y \vert \vert _2 \vert \vert X\beta_0 \vert \vert _2 cor(\beta_0,y)$$

    The term $\vert \vert X\beta_0 \vert \vert _2 cor(\beta_0,y)$ is how much the length $y_{err}$ changes as the coefficient $\beta_0$ changes.

  • Then $\lambda_{max}$ is equal to this rate of change and $\lambda_{max} = \frac{1}{2} \frac{\partial\sum{(y - X\beta_0\cdot s)^2}}{\partial s} = \frac{1}{2} 2 \vert \vert y \vert \vert _2 \vert \vert x \vert \vert _2 corr(\beta_0,y) = \vert \vert X\beta_0 \cdot y \vert \vert _2 $

    Where $\beta_0$ is the unit vector that corresponds to the angle in the first step of the LARS algorithm.

If $k$ is the number of vectors that share the maximum correlation then we have

$\lambda_{max} = \sqrt{k} \vert \vert X^T y\vert \vert_\infty $

In other words. LARS gives you the initial decrease of the SSE-term (sum of squared error) as you increase the length of the vector $\beta$ (the penalty term) and the ratio of these is then equal to your $\lambda_{max}$

The image below explains the concept how the LARS algorithm can help in finding $\lambda_{max}$.

  • In the image we see schematically the fitting of the vector $y$ by the vectors $x_1$ and $x_2$.
  • The dotted gray vectors depict the OLS (ordinary least squares) solution with $y_{\perp}$ the part of $y$ that is perpendicular (the error) to the span of the vectors $x_1$ and $x_2$ (and the perpendicular vector is the shortest distance and the smallest sum of squares)
  • the gray lines are iso-lines for which $\vert \vert \beta \vert \vert_1$ is constant
  • The vectors $\beta_0$ and $\beta_1$ depict the path that is being followed in a LARS regression.
  • The blue lines and green lines are the vectors that depict the error term, which decreases when the penalty term decreases and the solution 'moves' along the LARS path from $0$ to $\hat{\beta}_{OLS}$.

    The length of these lines corresponds to the SSE and when they reach the perpendicular vector $y_{\perp}$ you have the least sum of squares

Now, initially the path is followed along the vector $x_i$ that has the highest correlation with $y$. For this vector the change of the $\sqrt{SSE}$ as function of the change of $\vert \vert \beta \vert \vert$ is largest (namely equal to the correlation of that vector $x_i$ and $y$). While you follow the path this change of $\sqrt{SSE}$/$\vert \vert \beta \vert \vert$ decreases until a path along a combination of two vectors becomes more 'efficient', and so on. Notice that the ratio of the change $\sqrt{SSE}$ and change $\vert \vert \beta \vert \vert$ is a monotonously decreasing function. Therefore, the solution for $\lambda_{max}$ must be at the origin.

graphical example of LARS procedure