Solved – How is the minimum $\lambda$ computed in group LASSO

lassomodelingrregularization

The LASSO problem works by minimizing

$$\min_\beta (\frac{1}{2}\left\rVert y-X\beta\right\rVert^2_2+\lambda\left\rVert\beta\right\rVert_1)$$

Here in this webpage I found that the minimal value of the $\lambda$ parameter that makes $0$ all the $\beta$ coefficients in the model is computed as $\lambda=\left\lVert X^ty\right\rVert_\infty$, so any value of $\lambda$ greater than this will send all the $\beta$ coefficients to zero.

On the other side, if our variables are grouped into $m$ groups of size $p_l$ the group LASSO works by solving:

$$\min_{\vec{\beta}}\left\{\frac{1}{2} \left\lVert\vec{y}-\sum_{l=1}^mX^{(l)}\vec{\beta^{(l)}}\right\rVert_2^2 +\lambda\sum_{l=1}^m\sqrt{p_l}\left\lVert\vec{\beta^{(l)}}\right\rVert_2\right\},$$

Is there a way to find the minimal value for the $\lambda$ parameter in the case of the group LASSO? Consider for instance the gglasso package and the BostonHousing dataset. I will consider a random grouping structure for the data:

library(mlbench, gglasso)
data("BostonHousing")
x <- data.matrix(BostonHousing[,-14])
y <- BostonHousing[,14]
index <- c(1,1,2,2,2,2,2,2,2,3,3,3,3)
fit = gglasso(x, y, loss="ls", nlambda=5, intercept=TRUE, group = index)
fit$lambda[1]
> 389.1944
fit$beta[,1]
> crim      zn   indus    chas     nox      rm     age     dis     rad     tax ptratio       b   lstat 
   0       0       0       0       0       0       0       0       0       0       0       0       0 

I would really appreciate any insight, article or book that shows how to find this minimal lambda, and that demonstrates why is it computed that way, not just for the group LASSO but also for the LASSO.

Best Answer

If a change of $\beta$ in any direction will not decrease the cost/objective function then you have found yourself in, at least, a local minimum. The calculations below will show for which λ the solution/point $\beta = 0$ stops to be a minimum.

Consider the effect of 'a change of $\beta^{(l)}$ by an infinitesimal distance $\partial l$' on the 'change of 1) the error/residual term and 2) the penalty term'.

$$\underbrace{ \frac{1}{2} \left\lVert\vec{y}-\sum_{l=1}^mX^{(l)}\vec{\beta^{(l)}}\right\rVert_2^2}_{\text{RSS term}} + \underbrace { \lambda\sum_{l=1}^m\sqrt{p_l}\left\lVert\vec{\beta^{(l)}}\right\rVert_2}_{\text{ penalty term}}$$

  • The penalty term will change by: $$\partial \left( \lambda \sqrt{p_l} \lVert\vec{\beta^{(l)}}\rVert_2 \right) = \left( \lambda \sqrt{p_l} \right) \partial l$$ (independent on the direction of change)
  • The error term will change by: $$\partial \frac{1}{2} RSS = \left( \lvert {X^{(l)}}^Ty \rvert \right) \partial l $$ where, if all beta are zero, then this term ${X^{(l)}}^Ty$ is the gradient of the RSS term. This gradient is the direction in which the directional derivative, change/minimization, will be greatest, and it's value is $\lvert {X^{(l)}}^Ty \rvert$ .

So per group the relative change of error term and penalty term (which needs to be greater than 1 or otherwise the overall cost term does not decrease) will be:

$$ \frac{ \lvert {X^{(l)}}^Ty \rvert } { \lambda \sqrt{p_l} } > 1$$

Thus

$$ \lambda_{min} = \underset{l}{max} \left(\frac{ \lvert {X^{(l)}}^Ty \rvert } { \sqrt{p_l} } \right) $$

which becomes $\lambda=\left\lVert X^ty\right\rVert_\infty$ if all group sizes $p_l$ are equal to one.

Note that, initially, a change of $\beta^{(l)}$ in two groups together is not beneficial. You could always improve the solution by shifting more weight to the group with a higher ratio for $\frac{ \lvert {X^{(l)}}^Ty \rvert } { \lambda \sqrt{p_l} }$.

This is most easily/intuitively seen in a geometrical viewpoint. The shape of iso-surface for beta is a polytope which makes contact with the iso-surface for the error which is a ellipsoid. They will initially contact in a point of the polytope (see for instance the graphical views in this answer or this answer ).


Instead of $\lambda_{min}$ it might be better to speak about the suppremum (the smallest upper bound).

We have for all $\lambda$ with non-zero $\beta$ that $\lambda < \lambda_{min}$

$$ \lambda < \underset{l}{max} \left(\frac{ \lvert {X^{(l)}}^Ty \rvert } { \sqrt{p_l} } \right) $$

Related Question