Solved – How is the minimum $\lambda$ computed in group LASSO

lassomodelingrregularization

The LASSO problem works by minimizing

$$\min_\beta (\frac{1}{2}\left\rVert y-X\beta\right\rVert^2_2+\lambda\left\rVert\beta\right\rVert_1)$$

Here in this webpage I found that the minimal value of the $\lambda$ parameter that makes $0$ all the $\beta$ coefficients in the model is computed as $\lambda=\left\lVert X^ty\right\rVert_\infty$, so any value of $\lambda$ greater than this will send all the $\beta$ coefficients to zero.

On the other side, if our variables are grouped into $m$ groups of size $p_l$ the group LASSO works by solving:

$$\min_{\vec{\beta}}\left\{\frac{1}{2} \left\lVert\vec{y}-\sum_{l=1}^mX^{(l)}\vec{\beta^{(l)}}\right\rVert_2^2 +\lambda\sum_{l=1}^m\sqrt{p_l}\left\lVert\vec{\beta^{(l)}}\right\rVert_2\right\},$$

Is there a way to find the minimal value for the $\lambda$ parameter in the case of the group LASSO? Consider for instance the gglasso package and the BostonHousing dataset. I will consider a random grouping structure for the data:

library(mlbench, gglasso)
data("BostonHousing")
x <- data.matrix(BostonHousing[,-14])
y <- BostonHousing[,14]
index <- c(1,1,2,2,2,2,2,2,2,3,3,3,3)
fit = gglasso(x, y, loss="ls", nlambda=5, intercept=TRUE, group = index)
fit$lambda[1]
> 389.1944
fit$beta[,1]
> crim      zn   indus    chas     nox      rm     age     dis     rad     tax ptratio       b   lstat 
   0       0       0       0       0       0       0       0       0       0       0       0       0

I would really appreciate any insight, article or book that shows how to find this minimal lambda, and that demonstrates why is it computed that way, not just for the group LASSO but also for the LASSO.

Best Answer

If a change of $\beta$ in any direction will not decrease the cost/objective function then you have found yourself in, at least, a local minimum. The calculations below will show for which λ the solution/point $\beta = 0$ stops to be a minimum.

Consider the effect of 'a change of $\beta^{(l)}$ by an infinitesimal distance $\partial l$' on the 'change of 1) the error/residual term and 2) the penalty term'.

$$\underbrace{ \frac{1}{2} \left\lVert\vec{y}-\sum_{l=1}^mX^{(l)}\vec{\beta^{(l)}}\right\rVert_2^2}_{\text{RSS term}} + \underbrace { \lambda\sum_{l=1}^m\sqrt{p_l}\left\lVert\vec{\beta^{(l)}}\right\rVert_2}_{\text{ penalty term}}$$

The penalty term will change by: $$\partial \left( \lambda \sqrt{p_l} \lVert\vec{\beta^{(l)}}\rVert_2 \right) = \left( \lambda \sqrt{p_l} \right) \partial l$$ (independent on the direction of change)
The error term will change by: $$\partial \frac{1}{2} RSS = \left( \lvert {X^{(l)}}^Ty \rvert \right) \partial l $$ where, if all beta are zero, then this term ${X^{(l)}}^Ty$ is the gradient of the RSS term. This gradient is the direction in which the directional derivative, change/minimization, will be greatest, and it's value is $\lvert {X^{(l)}}^Ty \rvert$ .

So per group the relative change of error term and penalty term (which needs to be greater than 1 or otherwise the overall cost term does not decrease) will be:

$$ \frac{ \lvert {X^{(l)}}^Ty \rvert } { \lambda \sqrt{p_l} } > 1$$

Thus

$$ \lambda_{min} = \underset{l}{max} \left(\frac{ \lvert {X^{(l)}}^Ty \rvert } { \sqrt{p_l} } \right) $$

which becomes $\lambda=\left\lVert X^ty\right\rVert_\infty$ if all group sizes $p_l$ are equal to one.

Note that, initially, a change of $\beta^{(l)}$ in two groups together is not beneficial. You could always improve the solution by shifting more weight to the group with a higher ratio for $\frac{ \lvert {X^{(l)}}^Ty \rvert } { \lambda \sqrt{p_l} }$.

This is most easily/intuitively seen in a geometrical viewpoint. The shape of iso-surface for beta is a polytope which makes contact with the iso-surface for the error which is a ellipsoid. They will initially contact in a point of the polytope (see for instance the graphical views in this answer or this answer ).

Instead of $\lambda_{min}$ it might be better to speak about the suppremum (the smallest upper bound).

We have for all $\lambda$ with non-zero $\beta$ that $\lambda < \lambda_{min}$

$$ \lambda < \underset{l}{max} \left(\frac{ \lvert {X^{(l)}}^Ty \rvert } { \sqrt{p_l} } \right) $$

Related Solutions

Solved – the smallest $\lambda$ that gives a 0 component in lasso

The lasso estimate described in the question is the lagrange multiplier equivalent of the following optimization problem:

$${\text{minimize } f(\beta) \text{ subject to } g(\beta) \leq t}$$

$$\begin{align} f(\beta) &= \frac{1}{2n} \vert\vert y-X\beta \vert\vert_2^2 \\ g(\beta) &= \vert\vert \beta \vert\vert_1 \end{align}$$

This optimizazion has a geometric representation of finding the point of contact between a multidimensional sphere and a polytope (spanned by the vectors of X). The surface of the polytope represents $g(\beta)$. The square of the radius of the sphere represents the function $f(\beta)$ and is minimized when the surfaces contact.

The images below provides a graphical explanation. The images made use of the following simple problem with vectors of length 3 (for simplicity in order to be able to make a drawing):

$$\begin{bmatrix} y_1 \\ y_2 \\ y_3\\ \end{bmatrix} = \begin{bmatrix} 1.4 \\ 1.84 \\ 0.32\\ \end{bmatrix} = \beta_1 \begin{bmatrix} 0.8 \\ 0.6 \\ 0\\ \end{bmatrix} +\beta_2 \begin{bmatrix} 0 \\ 0.6 \\ 0.8\\ \end{bmatrix} +\beta_3 \begin{bmatrix} 0.6 \\ 0.64 \\ -0.48\\ \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3\\ \end{bmatrix} $$ and we minimize $\epsilon_1^2+\epsilon_2^2+\epsilon_3^2$ with the constraint $abs(\beta_1)+abs(\beta_2)+abs(\beta_3) \leq t$

The images show:

The red surface depicts the constraint, a polytope spanned by X.
And the green surface depicts the minimalized surface, a sphere.
The blue line shows the lasso path, the solutions that we find as we change $t$ or $\lambda$.
The green vector shows the OLS solution $\hat{y}$ (which was chosen as $\beta_1=\beta_2=\beta_3=1$ or $\hat{y} = x_1 + x_2 + x_3$.
The three black vectors are $x_1 = (0.8,0.6,0)$, $x_2 = (0,0.6,0.8)$ and $x_3 = (0.6,0.64,-0.48)$.

We show three images:

In the first image only a point of the polytope is touching the sphere. This image demonstrates very well why the lasso solution is not just a multiple of the OLS solution. The direction of the OLS solution adds stronger to the sum $\vert \beta \vert_1$. In this case only a single $\beta_i$ is non-zero.
In the second image a ridge of the polytope is touching the sphere (in higher dimensions we get higher dimensional analogues). In this case multiple $\beta_i$ are non-zero.
In the third image a facet tof the polytope is touching the sphere. In this case all the $\beta_i$ are nonzero.

The range of $t$ or $\lambda$ for which we have the first and third cases can be easily calculated due to their simple geometric representation.

##Case 1: Only a single $\beta_i$ non-zero##

The non-zero $\beta_i$ is the one for which the associated vector $x_i$ has the highest absolute value of the covariance with $\hat{y}$ (this is the point of the parrallelotope which closest to the OLS solution). We can calculate the Lagrange multiplier $\lambda_{max}$ below which we have at least a non-zero $\beta$ by taking the derivative with $\pm\beta_i$ (the sign depending on whether we increase the $\beta_i$ in negative or positive direction ):

$$\frac{\partial ( \frac{1}{2n} \vert \vert y - X\beta \vert \vert_2^2 - \lambda \vert \vert \beta \vert \vert_1 )}{\pm \partial \beta_i} = 0$$

which leads to

$$\lambda_{max} = \frac{ \left( \frac{1}{2n}\frac{\partial ( \vert \vert y - X\beta \vert \vert_2^2}{\pm \partial \beta_i} \right) }{ \left( \frac{ \vert \vert \beta \vert \vert_1 )}{\pm \partial \beta_i}\right)} = \pm \frac{\partial ( \frac{1}{2n} \vert \vert y - X\beta \vert \vert_2^2}{\partial \beta_i} = \pm \frac{1}{n} x_i \cdot y $$

which equals the $\vert \vert X^Ty \vert \vert_\infty$ mentioned in the comments.

where we should notice that this is only true for the special case in which the tip of the polytope is touching the sphere (so this is not a general solution, although generalization is straightforward).

##Case 3: All $\beta_i$ are non-zero.##

In this case that a facet of the polytope is touching the sphere. Then the direction of change of the lasso path is normal to the surface of the particular facet.

The polytope has many facets, with positive and negative contributions of the $x_i$. In the case of the last lasso step, when the lasso solution is close to the ols solution, then the contributions of the $x_i$ must be defined by the sign of the OLS solution. The normal of the facet can be defined by taking the gradient of the function $\vert \vert \beta(r) \vert \vert_1 $, the value of the sum of beta at the point $r$, which is:

$$ n = - \nabla_r ( \vert \vert \beta(r) \vert \vert_1) = -\nabla_r ( \text{sign} (\hat{\beta}) \cdot (X^TX)^{-1}X^Tr ) = -\text{sign} (\hat{\beta}) \cdot (X^TX)^{-1}X^T $$

and the equivalent change of beta for this direction is:

$$ \vec{\beta}_{last} = (X^TX)^{-1}X n = -(X^TX)^{-1}X^T [\text{sign} (\hat{\beta}) \cdot (X^TX)^{-1}X^T]$$

which after some algebraic tricks with shifting the transposes ($A^TB^T = [BA]^T$) and distribution of brackets becomes

$$ \vec{\beta}_{last} = - (X^TX)^{-1} \text{sign} (\hat{\beta}) $$

we normalize this direction:

$$ \vec{\beta}_{last,normalized} = \frac{\vec{\beta}_{last}}{\sum \vec{\beta}_{last} \cdot sign(\hat{\beta})} $$

To find the $\lambda_{min}$ below which all coefficients are non-zero. We only have to calculate back from the OLS solution back to the point where one of the coefficients is zero,

$$ d = min \left( \frac{\hat{\beta}}{\vec{\beta}_{last,normalized}} \right)\qquad \text{with the condition that } \frac{\hat{\beta}}{\vec{\beta}_{last,normalized}} >0$$

,and at this point we evaluate the derivative (as before when we calculate $\lambda_{max}$). We use that for a quadratic function we have $q'(x) = 2 q(1) x$:

$$\lambda_{min} = \frac{d}{n} \vert \vert X \vec{\beta}_{last,normalized} \vert \vert_2^2 $$

##Images##

a point of the polytope is touching the sphere, a single $\beta_i$ is non-zero:

a ridge (or differen in multiple dimensions) of the polytope is touching the sphere, many $\beta_i$ are non-zero:

a facet of the polytope is touching the sphere, all $\beta_i$ are non-zero:

##Code example: ##

library(lars)    
data(diabetes)
y <- diabetes$y - mean(diabetes$y)
x <- diabetes$x

# models
lmc <- coef(lm(y~0+x))
modl <- lars(diabetes$x, diabetes$y, type="lasso")

# matrix equation
d_x <- matrix(rep(x[,1],9),length(x[,1])) %*% diag(sign(lmc[-c(1)]/lmc[1]))
x_c = x[,-1]-d_x
y_c = -x[,1]

# solving equation
cof <- coefficients(lm(y_c~0+x_c))
cof <- c(1-sum(cof*sign(lmc[-c(1)]/lmc[1])),cof)

# alternatively the last direction of change in coefficients is found by:
solve(t(x) %*% x) %*% sign(lmc)
  
# solution by lars package
cof_m <-(coefficients(modl)[13,]-coefficients(modl)[12,])

# last step
dist <- x %*% (cof/sum(cof*sign(lmc[])))
#dist_m <- x %*% (cof_m/sum(cof_m*sign(lmc[]))) #for comparison

# calculate back to zero
shrinking_set <- which(-lmc[]/cof>0)  #only the positive values
step_last <- min((-lmc/cof)[shrinking_set])

d_err_d_beta <- step_last*sum(dist^2)

# compare
modl[4] #all computed lambda
d_err_d_beta  # lambda last change
max(t(x) %*% y) # lambda first change
enter code here

note: those last three lines are the most important

> modl[4]            # all computed lambda by algorithm
$lambda
 [1] 949.435260 889.315991 452.900969 316.074053 130.130851  88.782430  68.965221  19.981255   5.477473   5.089179
[11]   2.182250   1.310435

> d_err_d_beta       # lambda last change by calculating only last step
    xhdl 
1.310435 
> max(t(x) %*% y)    # lambda first change by max(x^T y)
[1] 949.4353

(edit notice: this answer had been edited a lot. I have deleted old parts. But left this answer from July 14 which contains a nice explanation of the separation of $y$, $\epsilon$ and $\hat{y}$. The problems that occured in this initial answer have been solved. The final step of the LARS algorithm is easy to find, it is the normal of the polytope defined by the sign of the $\hat{\beta}$) #ANSWER JULI 14 2017#

for $\lambda < \lambda_{max}$ we have at least one non-zero coefficient (and above all are zero)

for $\lambda < \lambda_{min}$ we have all coefficients non-zero (and above at least one coefficient is zero)

finding $\lambda_{max}$

You can use the following steps to determine $\lambda_{max}$ (and this technique will also help for $\lambda_{min}$ although a bit more difficult). For $\lambda>\lambda_{max}$ we have $\hat{\beta}^\lambda$ = 0, as the penalty in the term $\lambda \vert \vert \beta \vert \vert_1$ will be too large, and an increase of $\beta$ does not reduce $(1/2n) \vert \vert y-X \beta \vert \vert ^2_2$ sufficiently to optimize the following expression

(1) \begin{equation}\frac{1}{2n} \vert \vert y-X \beta \vert \vert ^2_2 + \lambda \vert \vert \beta \vert \vert_1 \end{equation}

The basic idea is that at some level of $\lambda$ the increase of $\vert \vert \beta \vert \vert_1$ will cause the term $(1/2n) \vert \vert y-X \beta \vert \vert ^2_2$ to decreases more than the term $\lambda \vert \vert \beta \vert \vert_1$ increases. This point at which the terms are equal can be calculated exactly and at this point the expression (1) can be optimized by an increase of $\beta$ and this is the point below which some terms of $\beta$ are non-zero.

Determine the angle, or unit vector $\beta_0$, along which the sum $(y - X\beta)^2$ would decrease most. (in the same way as the first step in the LARS algorithm explained in the very nice article referenced by chRrr) This would relate to the variable(s) $x_i$ that correlates the most with $y$.
Calculate the rate of change at which the sum of squares of the error change if you increase the length of the predicted vector $\hat{y}$

$r_1= \frac{1}{2n} \frac{\partial\sum{(y - X\beta_0\cdot s)^2}}{\partial s}$

this is related to the angle between $\vec{\beta_0}$ and $\vec{y}$. The change of the square of the length of the error term $y_{err}$ will be equal to

$\frac{\partial y_{err}^2}{\partial \vert \vert \beta_0 \vert \vert _1} = 2 y_{err} \frac{\partial y_{err}}{\partial \vert \vert \beta_0 \vert \vert _1} = 2 \vert \vert y \vert \vert _2 \vert \vert X\beta_0 \vert \vert _2 cor(\beta_0,y)$

The term $\vert \vert X\beta_0 \vert \vert _2 cor(\beta_0,y)$ is how much the length $y_{err}$ changes as the coefficient $\beta_0$ changes and this includes a multiplication with $X$ (since the larger $X$ the larger the change as the coefficient changes) and a multiplication with the correlation (since only the projection part reduces the length of the error vector).

Then $\lambda_{max}$ is equal to this rate of change and $\lambda_{max} = \frac{1}{2n} \frac{\partial\sum{(y - X\beta_0\cdot s)^2}}{\partial s} = \frac{1}{2n} 2 \vert \vert y \vert \vert _2 \vert \vert x \vert \vert _2 corr(\beta_0,y) = \frac{1}{n} \vert \vert X\beta_0 \cdot y \vert \vert _2 $

Where $\beta_0$ is the unit vector that corresponds to the angle in the first step of the LARS algorithm.

If $k$ is the number of vectors that share the maximum correlation then we have

$\lambda_{max} = \frac{\sqrt{k}}{n} \vert \vert X^T y\vert \vert_\infty $

In other words. LARS gives you the initial decline of the SSE as you increase the length of the vector $\beta$ and this is then equal to your $\lambda_{max}$

##Graphical example##

The image below explains the concept how the LARS algorithm can help in finding $\lambda_{max}$. (and also hints how we can find $\lambda_{min}$)

In the image we see schematically the fitting of the vector $y$ by the vectors $x_1$ and $x_2$.
The dotted gray vectors depict the OLS solution with $y_{\perp}$ the part of $y$ that is perpendicular (the error) to the span of the vectors $x_1$ and $x_2$ (and the perpendicular vector is the shortest distance and the smallest sum of squares)
the gray lines are iso-lines for which $\vert \vert \beta \vert \vert_1$ is constant
The vectors $\beta_0$ and $\beta_1$ depict the path that is being followed in a LARS regression.
the blue lines and green lines are the vectors that depict the error term as the LARS regression algorithm is followed. the length of these lines is the SSE and when they reach the perpendicular vector $y_{\perp}$ you have the least sum of squares

Now, initially the path is followed along the vector $x_i$ that has the highest correlation with $y$. For this vector the change of the $\sqrt{SSE}$ as function of the change of $\vert \vert \beta \vert \vert$ is largest (namely equal to the correlation of that vector $x_i$ and $y$). While you follow the path this change of $\sqrt{SSE}$/$\vert \vert \beta \vert \vert$ decreases until a path along a combination of two vectors becomes more efficient, and so on. Notice that the ratio of the change $\sqrt{SSE}$ and change $\vert \vert \beta \vert \vert$ is a monotonously decreasing function. Therefore, the solution for $\lambda_{max}$ must be at the origin.

finding $\lambda_{min} $

The above illustration also shows how we can find $\lambda_{min}$ the point at which all of the coefficients are non zero. This occurs in the last step of the LARS procedure.

We can find this points with similar steps. We can determine the angle of the last step and the pre-last step and determine the point where these steps turn from the one into the other. This point is then used to calculate $\lambda_{min}$.

The expression is a bit more awkard since you do not calculate starting from zero. Actually, a solution may not exist. The vector along which the final step is made is in the direction of $X^T \cdot sign$ in which $sign$ is a vector with 1's and -1's depending on the direction of the change for the coefficient in the last step. This direction can be calculated if you know the result of the pre-last step (and could be achieved by itteration towards the first step but such a huge calculation does not seem to be what we want). I do not seem to find out if we can see directly what the sign is of the change of the coefficients in the last step.

Note note that we can more easily determine the point $\lambda_{opt}$ for which $\beta^\lambda$ is equal to the OLS solution. In the image this is related to the change of the vector $y_{\perp}$ as we move along the slope $\hat{\beta_n}$ (this might be more practical to calculate)

problems with sign changes

The LARS solution is close to the LASSO solution. A difference is in the number of steps. In the above method one would find out the direction of the last change and then go back from the OLS solution untill a coefficient becomes zero. A problem occurs if this point is not a point at which a coefficient is added to the set of active coefficients. It could be also a sign change and the LASSO and LARS solutions may differ in these points.

Solved – Why does Group Lasso use L2 norm for individual group penalties

Any $q > 1$ would define an estimator which performs group-wise selection and is the minimizer of a convex function. When $q=1$, the estimator reduces to a (weighted) lasso which does not perform group-wise selection. When $q < 1$, the objective function is non-convex.

In the original paper,

Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49-67.

the motivation is given (in figure 1) that the penalty looks like the lasso's penalty for coefficients in different groups while looking like the ridge regression's penalty for coefficients within the same group. This suggests that an explanation for why the group lasso uses $q=2$ reduces to an explanation for the utility of ridge regression over other $\ell_q$ penalized regression estimators. The standard intuition for this is that in ridge regression the coefficients are treated as being neutrally, in that they are neither being encouraged to be nearly sparse (as when $q < 2$) or to be nearly equal to each other (as when $q > 2$).

A more practical explanation for why $q=2$ could just be that the statistical community is comfortable with $\ell_2$ penalization and that the choice $q=2$ made possible deriving a LARS-type algorithm for fast computation. In the time when this paper was published, it was standard for new convex penalized regression estimators to be accompanied with a LARS-type algorithm.

Fig. 1.: