Solved – the smallest $\lambda$ that gives a 0 component in lasso

lassoregularization

Define the lasso estimate $$\hat\beta^\lambda = \arg\min_{\beta \in \mathbb{R}^p} \frac{1}{2n} \|y – X \beta\|_2^2 + \lambda \|\beta\|_1,$$ where the $i^{th}$ row $x_i \in \mathbb{R}^p$ of the design matrix $X \in \mathbb{R}^{n \times p}$ is a vector of covariates for explaining the stochastic response $y_i$ (for $i=1, \dots n$).

We know that for $\lambda \geq \frac{1}{n} \|X^T y\|_\infty$, the lasso estimate $\hat\beta^\lambda = 0$. (See, for instance, Lasso and Ridge tuning parameter scope.) In other notation, this is expressing that $\lambda_\max = \frac{1}{n} \|X^T y\|_\infty$. Notice that $\lambda_\mathrm{max} = \sup_{\hat\beta^\lambda \ne 0} \lambda.$ We can see this visually with the following image displaying the lasso solution path:

lasso solution path

Notice that on the far right hand side of the plot, all of the coefficients are zero. This happens at the point $\lambda_\mathrm{max}$ described above.

From this plot, we also notice that on the far left side, all of the coefficient are nonzero: what is the value of $\lambda$ at which any component of $\hat\beta^\lambda$ is initially zero? That is, what is $$\lambda_\textrm{min} = \min_{\exists j \, \mathrm{ s.t. } \, \hat\beta_j = 0} \lambda$$ equal to, as a function of $X$ and $y$? I'm interested in a closed form solution. In particular, I'm not interested in an algorithmic solution, such as, for instance, suggesting that LARS could find the knot through computation.

Despite my interests, it seems like $\lambda_\mathrm{min}$ may not be available in closed form, since, otherwise, lasso computational packages would likely take advantage of it when determining the tuning parameter depth during cross validation. In light of this, I'm interested in anything that can be theoretically shown about $\lambda_\mathrm{min}$ and (still) particularly interested in a closed form.

Best Answer

The lasso estimate described in the question is the lagrange multiplier equivalent of the following optimization problem:

$${\text{minimize } f(\beta) \text{ subject to } g(\beta) \leq t}$$

$$\begin{align} f(\beta) &= \frac{1}{2n} \vert\vert y-X\beta \vert\vert_2^2 \\ g(\beta) &= \vert\vert \beta \vert\vert_1 \end{align}$$

This optimizazion has a geometric representation of finding the point of contact between a multidimensional sphere and a polytope (spanned by the vectors of X). The surface of the polytope represents $g(\beta)$. The square of the radius of the sphere represents the function $f(\beta)$ and is minimized when the surfaces contact.

The images below provides a graphical explanation. The images made use of the following simple problem with vectors of length 3 (for simplicity in order to be able to make a drawing):

$$\begin{bmatrix} y_1 \\ y_2 \\ y_3\\ \end{bmatrix} = \begin{bmatrix} 1.4 \\ 1.84 \\ 0.32\\ \end{bmatrix} = \beta_1 \begin{bmatrix} 0.8 \\ 0.6 \\ 0\\ \end{bmatrix} +\beta_2 \begin{bmatrix} 0 \\ 0.6 \\ 0.8\\ \end{bmatrix} +\beta_3 \begin{bmatrix} 0.6 \\ 0.64 \\ -0.48\\ \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3\\ \end{bmatrix} $$ and we minimize $\epsilon_1^2+\epsilon_2^2+\epsilon_3^2$ with the constraint $abs(\beta_1)+abs(\beta_2)+abs(\beta_3) \leq t$

The images show:

  • The red surface depicts the constraint, a polytope spanned by X.
  • And the green surface depicts the minimalized surface, a sphere.
  • The blue line shows the lasso path, the solutions that we find as we change $t$ or $\lambda$.
  • The green vector shows the OLS solution $\hat{y}$ (which was chosen as $\beta_1=\beta_2=\beta_3=1$ or $\hat{y} = x_1 + x_2 + x_3$.
  • The three black vectors are $x_1 = (0.8,0.6,0)$, $x_2 = (0,0.6,0.8)$ and $x_3 = (0.6,0.64,-0.48)$.

We show three images:

  1. In the first image only a point of the polytope is touching the sphere. This image demonstrates very well why the lasso solution is not just a multiple of the OLS solution. The direction of the OLS solution adds stronger to the sum $\vert \beta \vert_1$. In this case only a single $\beta_i$ is non-zero.
  2. In the second image a ridge of the polytope is touching the sphere (in higher dimensions we get higher dimensional analogues). In this case multiple $\beta_i$ are non-zero.
  3. In the third image a facet tof the polytope is touching the sphere. In this case all the $\beta_i$ are nonzero.

The range of $t$ or $\lambda$ for which we have the first and third cases can be easily calculated due to their simple geometric representation.

##Case 1: Only a single $\beta_i$ non-zero##

The non-zero $\beta_i$ is the one for which the associated vector $x_i$ has the highest absolute value of the covariance with $\hat{y}$ (this is the point of the parrallelotope which closest to the OLS solution). We can calculate the Lagrange multiplier $\lambda_{max}$ below which we have at least a non-zero $\beta$ by taking the derivative with $\pm\beta_i$ (the sign depending on whether we increase the $\beta_i$ in negative or positive direction ):

$$\frac{\partial ( \frac{1}{2n} \vert \vert y - X\beta \vert \vert_2^2 - \lambda \vert \vert \beta \vert \vert_1 )}{\pm \partial \beta_i} = 0$$

which leads to

$$\lambda_{max} = \frac{ \left( \frac{1}{2n}\frac{\partial ( \vert \vert y - X\beta \vert \vert_2^2}{\pm \partial \beta_i} \right) }{ \left( \frac{ \vert \vert \beta \vert \vert_1 )}{\pm \partial \beta_i}\right)} = \pm \frac{\partial ( \frac{1}{2n} \vert \vert y - X\beta \vert \vert_2^2}{\partial \beta_i} = \pm \frac{1}{n} x_i \cdot y $$

which equals the $\vert \vert X^Ty \vert \vert_\infty$ mentioned in the comments.

where we should notice that this is only true for the special case in which the tip of the polytope is touching the sphere (so this is not a general solution, although generalization is straightforward).

##Case 3: All $\beta_i$ are non-zero.##

In this case that a facet of the polytope is touching the sphere. Then the direction of change of the lasso path is normal to the surface of the particular facet.

The polytope has many facets, with positive and negative contributions of the $x_i$. In the case of the last lasso step, when the lasso solution is close to the ols solution, then the contributions of the $x_i$ must be defined by the sign of the OLS solution. The normal of the facet can be defined by taking the gradient of the function $\vert \vert \beta(r) \vert \vert_1 $, the value of the sum of beta at the point $r$, which is:

$$ n = - \nabla_r ( \vert \vert \beta(r) \vert \vert_1) = -\nabla_r ( \text{sign} (\hat{\beta}) \cdot (X^TX)^{-1}X^Tr ) = -\text{sign} (\hat{\beta}) \cdot (X^TX)^{-1}X^T $$

and the equivalent change of beta for this direction is:

$$ \vec{\beta}_{last} = (X^TX)^{-1}X n = -(X^TX)^{-1}X^T [\text{sign} (\hat{\beta}) \cdot (X^TX)^{-1}X^T]$$

which after some algebraic tricks with shifting the transposes ($A^TB^T = [BA]^T$) and distribution of brackets becomes

$$ \vec{\beta}_{last} = - (X^TX)^{-1} \text{sign} (\hat{\beta}) $$

we normalize this direction:

$$ \vec{\beta}_{last,normalized} = \frac{\vec{\beta}_{last}}{\sum \vec{\beta}_{last} \cdot sign(\hat{\beta})} $$

To find the $\lambda_{min}$ below which all coefficients are non-zero. We only have to calculate back from the OLS solution back to the point where one of the coefficients is zero,

$$ d = min \left( \frac{\hat{\beta}}{\vec{\beta}_{last,normalized}} \right)\qquad \text{with the condition that } \frac{\hat{\beta}}{\vec{\beta}_{last,normalized}} >0$$

,and at this point we evaluate the derivative (as before when we calculate $\lambda_{max}$). We use that for a quadratic function we have $q'(x) = 2 q(1) x$:

$$\lambda_{min} = \frac{d}{n} \vert \vert X \vec{\beta}_{last,normalized} \vert \vert_2^2 $$

##Images##

a point of the polytope is touching the sphere, a single $\beta_i$ is non-zero:

first step of lasso path

a ridge (or differen in multiple dimensions) of the polytope is touching the sphere, many $\beta_i$ are non-zero:

middle of lasso path

a facet of the polytope is touching the sphere, all $\beta_i$ are non-zero:

final step of lasso path

##Code example: ##

library(lars)    
data(diabetes)
y <- diabetes$y - mean(diabetes$y)
x <- diabetes$x

# models
lmc <- coef(lm(y~0+x))
modl <- lars(diabetes$x, diabetes$y, type="lasso")

# matrix equation
d_x <- matrix(rep(x[,1],9),length(x[,1])) %*% diag(sign(lmc[-c(1)]/lmc[1]))
x_c = x[,-1]-d_x
y_c = -x[,1]

# solving equation
cof <- coefficients(lm(y_c~0+x_c))
cof <- c(1-sum(cof*sign(lmc[-c(1)]/lmc[1])),cof)

# alternatively the last direction of change in coefficients is found by:
solve(t(x) %*% x) %*% sign(lmc)
  
# solution by lars package
cof_m <-(coefficients(modl)[13,]-coefficients(modl)[12,])

# last step
dist <- x %*% (cof/sum(cof*sign(lmc[])))
#dist_m <- x %*% (cof_m/sum(cof_m*sign(lmc[]))) #for comparison

# calculate back to zero
shrinking_set <- which(-lmc[]/cof>0)  #only the positive values
step_last <- min((-lmc/cof)[shrinking_set])

d_err_d_beta <- step_last*sum(dist^2)

# compare
modl[4] #all computed lambda
d_err_d_beta  # lambda last change
max(t(x) %*% y) # lambda first change
enter code here

note: those last three lines are the most important

> modl[4]            # all computed lambda by algorithm
$lambda
 [1] 949.435260 889.315991 452.900969 316.074053 130.130851  88.782430  68.965221  19.981255   5.477473   5.089179
[11]   2.182250   1.310435

> d_err_d_beta       # lambda last change by calculating only last step
    xhdl 
1.310435 
> max(t(x) %*% y)    # lambda first change by max(x^T y)
[1] 949.4353

(edit notice: this answer had been edited a lot. I have deleted old parts. But left this answer from July 14 which contains a nice explanation of the separation of $y$, $\epsilon$ and $\hat{y}$. The problems that occured in this initial answer have been solved. The final step of the LARS algorithm is easy to find, it is the normal of the polytope defined by the sign of the $\hat{\beta}$) #ANSWER JULI 14 2017#

for $\lambda < \lambda_{max}$ we have at least one non-zero coefficient (and above all are zero)

for $\lambda < \lambda_{min}$ we have all coefficients non-zero (and above at least one coefficient is zero)

finding $\lambda_{max}$

You can use the following steps to determine $\lambda_{max}$ (and this technique will also help for $\lambda_{min}$ although a bit more difficult). For $\lambda>\lambda_{max}$ we have $\hat{\beta}^\lambda$ = 0, as the penalty in the term $\lambda \vert \vert \beta \vert \vert_1$ will be too large, and an increase of $\beta$ does not reduce $(1/2n) \vert \vert y-X \beta \vert \vert ^2_2$ sufficiently to optimize the following expression

(1) \begin{equation}\frac{1}{2n} \vert \vert y-X \beta \vert \vert ^2_2 + \lambda \vert \vert \beta \vert \vert_1 \end{equation}

The basic idea is that at some level of $\lambda$ the increase of $\vert \vert \beta \vert \vert_1$ will cause the term $(1/2n) \vert \vert y-X \beta \vert \vert ^2_2$ to decreases more than the term $\lambda \vert \vert \beta \vert \vert_1$ increases. This point at which the terms are equal can be calculated exactly and at this point the expression (1) can be optimized by an increase of $\beta$ and this is the point below which some terms of $\beta$ are non-zero.

  • Determine the angle, or unit vector $\beta_0$, along which the sum $(y - X\beta)^2$ would decrease most. (in the same way as the first step in the LARS algorithm explained in the very nice article referenced by chRrr) This would relate to the variable(s) $x_i$ that correlates the most with $y$.
  • Calculate the rate of change at which the sum of squares of the error change if you increase the length of the predicted vector $\hat{y}$

$r_1= \frac{1}{2n} \frac{\partial\sum{(y - X\beta_0\cdot s)^2}}{\partial s}$

this is related to the angle between $\vec{\beta_0}$ and $\vec{y}$. The change of the square of the length of the error term $y_{err}$ will be equal to

$\frac{\partial y_{err}^2}{\partial \vert \vert \beta_0 \vert \vert _1} = 2 y_{err} \frac{\partial y_{err}}{\partial \vert \vert \beta_0 \vert \vert _1} = 2 \vert \vert y \vert \vert _2 \vert \vert X\beta_0 \vert \vert _2 cor(\beta_0,y)$

The term $\vert \vert X\beta_0 \vert \vert _2 cor(\beta_0,y)$ is how much the length $y_{err}$ changes as the coefficient $\beta_0$ changes and this includes a multiplication with $X$ (since the larger $X$ the larger the change as the coefficient changes) and a multiplication with the correlation (since only the projection part reduces the length of the error vector).


Then $\lambda_{max}$ is equal to this rate of change and $\lambda_{max} = \frac{1}{2n} \frac{\partial\sum{(y - X\beta_0\cdot s)^2}}{\partial s} = \frac{1}{2n} 2 \vert \vert y \vert \vert _2 \vert \vert x \vert \vert _2 corr(\beta_0,y) = \frac{1}{n} \vert \vert X\beta_0 \cdot y \vert \vert _2 $

Where $\beta_0$ is the unit vector that corresponds to the angle in the first step of the LARS algorithm.

If $k$ is the number of vectors that share the maximum correlation then we have

$\lambda_{max} = \frac{\sqrt{k}}{n} \vert \vert X^T y\vert \vert_\infty $


In other words. LARS gives you the initial decline of the SSE as you increase the length of the vector $\beta$ and this is then equal to your $\lambda_{max}$


##Graphical example##

The image below explains the concept how the LARS algorithm can help in finding $\lambda_{max}$. (and also hints how we can find $\lambda_{min}$)

  • In the image we see schematically the fitting of the vector $y$ by the vectors $x_1$ and $x_2$.
  • The dotted gray vectors depict the OLS solution with $y_{\perp}$ the part of $y$ that is perpendicular (the error) to the span of the vectors $x_1$ and $x_2$ (and the perpendicular vector is the shortest distance and the smallest sum of squares)
  • the gray lines are iso-lines for which $\vert \vert \beta \vert \vert_1$ is constant
  • The vectors $\beta_0$ and $\beta_1$ depict the path that is being followed in a LARS regression.
  • the blue lines and green lines are the vectors that depict the error term as the LARS regression algorithm is followed. the length of these lines is the SSE and when they reach the perpendicular vector $y_{\perp}$ you have the least sum of squares

Now, initially the path is followed along the vector $x_i$ that has the highest correlation with $y$. For this vector the change of the $\sqrt{SSE}$ as function of the change of $\vert \vert \beta \vert \vert$ is largest (namely equal to the correlation of that vector $x_i$ and $y$). While you follow the path this change of $\sqrt{SSE}$/$\vert \vert \beta \vert \vert$ decreases until a path along a combination of two vectors becomes more efficient, and so on. Notice that the ratio of the change $\sqrt{SSE}$ and change $\vert \vert \beta \vert \vert$ is a monotonously decreasing function. Therefore, the solution for $\lambda_{max}$ must be at the origin.

graphical example of LARS procedure

finding $\lambda_{min} $

The above illustration also shows how we can find $\lambda_{min}$ the point at which all of the coefficients are non zero. This occurs in the last step of the LARS procedure.

We can find this points with similar steps. We can determine the angle of the last step and the pre-last step and determine the point where these steps turn from the one into the other. This point is then used to calculate $\lambda_{min}$.

The expression is a bit more awkard since you do not calculate starting from zero. Actually, a solution may not exist. The vector along which the final step is made is in the direction of $X^T \cdot sign$ in which $sign$ is a vector with 1's and -1's depending on the direction of the change for the coefficient in the last step. This direction can be calculated if you know the result of the pre-last step (and could be achieved by itteration towards the first step but such a huge calculation does not seem to be what we want). I do not seem to find out if we can see directly what the sign is of the change of the coefficients in the last step.

Note note that we can more easily determine the point $\lambda_{opt}$ for which $\beta^\lambda$ is equal to the OLS solution. In the image this is related to the change of the vector $y_{\perp}$ as we move along the slope $\hat{\beta_n}$ (this might be more practical to calculate)

problems with sign changes

The LARS solution is close to the LASSO solution. A difference is in the number of steps. In the above method one would find out the direction of the last change and then go back from the OLS solution untill a coefficient becomes zero. A problem occurs if this point is not a point at which a coefficient is added to the set of active coefficients. It could be also a sign change and the LASSO and LARS solutions may differ in these points.

Related Question