Define the lasso estimate $$\hat\beta^\lambda = \arg\min_{\beta \in \mathbb{R}^p} \frac{1}{2n} \|y – X \beta\|_2^2 + \lambda \|\beta\|_1,$$ where the $i^{th}$ row $x_i \in \mathbb{R}^p$ of the design matrix $X \in \mathbb{R}^{n \times p}$ is a vector of covariates for explaining the stochastic response $y_i$ (for $i=1, \dots n$).
We know that for $\lambda \geq \frac{1}{n} \|X^T y\|_\infty$, the lasso estimate $\hat\beta^\lambda = 0$. (See, for instance, Lasso and Ridge tuning parameter scope.) In other notation, this is expressing that $\lambda_\max = \frac{1}{n} \|X^T y\|_\infty$. Notice that $\lambda_\mathrm{max} = \sup_{\hat\beta^\lambda \ne 0} \lambda.$ We can see this visually with the following image displaying the lasso solution path:
Notice that on the far right hand side of the plot, all of the coefficients are zero. This happens at the point $\lambda_\mathrm{max}$ described above.
From this plot, we also notice that on the far left side, all of the coefficient are nonzero: what is the value of $\lambda$ at which any component of $\hat\beta^\lambda$ is initially zero? That is, what is $$\lambda_\textrm{min} = \min_{\exists j \, \mathrm{ s.t. } \, \hat\beta_j = 0} \lambda$$ equal to, as a function of $X$ and $y$? I'm interested in a closed form solution. In particular, I'm not interested in an algorithmic solution, such as, for instance, suggesting that LARS could find the knot through computation.
Despite my interests, it seems like $\lambda_\mathrm{min}$ may not be available in closed form, since, otherwise, lasso computational packages would likely take advantage of it when determining the tuning parameter depth during cross validation. In light of this, I'm interested in anything that can be theoretically shown about $\lambda_\mathrm{min}$ and (still) particularly interested in a closed form.
Best Answer
The lasso estimate described in the question is the lagrange multiplier equivalent of the following optimization problem:
$${\text{minimize } f(\beta) \text{ subject to } g(\beta) \leq t}$$
$$\begin{align} f(\beta) &= \frac{1}{2n} \vert\vert y-X\beta \vert\vert_2^2 \\ g(\beta) &= \vert\vert \beta \vert\vert_1 \end{align}$$
This optimizazion has a geometric representation of finding the point of contact between a multidimensional sphere and a polytope (spanned by the vectors of X). The surface of the polytope represents $g(\beta)$. The square of the radius of the sphere represents the function $f(\beta)$ and is minimized when the surfaces contact.
The images below provides a graphical explanation. The images made use of the following simple problem with vectors of length 3 (for simplicity in order to be able to make a drawing):
$$\begin{bmatrix} y_1 \\ y_2 \\ y_3\\ \end{bmatrix} = \begin{bmatrix} 1.4 \\ 1.84 \\ 0.32\\ \end{bmatrix} = \beta_1 \begin{bmatrix} 0.8 \\ 0.6 \\ 0\\ \end{bmatrix} +\beta_2 \begin{bmatrix} 0 \\ 0.6 \\ 0.8\\ \end{bmatrix} +\beta_3 \begin{bmatrix} 0.6 \\ 0.64 \\ -0.48\\ \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3\\ \end{bmatrix} $$ and we minimize $\epsilon_1^2+\epsilon_2^2+\epsilon_3^2$ with the constraint $abs(\beta_1)+abs(\beta_2)+abs(\beta_3) \leq t$
The images show:
We show three images:
The range of $t$ or $\lambda$ for which we have the first and third cases can be easily calculated due to their simple geometric representation.
##Case 1: Only a single $\beta_i$ non-zero##
The non-zero $\beta_i$ is the one for which the associated vector $x_i$ has the highest absolute value of the covariance with $\hat{y}$ (this is the point of the parrallelotope which closest to the OLS solution). We can calculate the Lagrange multiplier $\lambda_{max}$ below which we have at least a non-zero $\beta$ by taking the derivative with $\pm\beta_i$ (the sign depending on whether we increase the $\beta_i$ in negative or positive direction ):
$$\frac{\partial ( \frac{1}{2n} \vert \vert y - X\beta \vert \vert_2^2 - \lambda \vert \vert \beta \vert \vert_1 )}{\pm \partial \beta_i} = 0$$
which leads to
$$\lambda_{max} = \frac{ \left( \frac{1}{2n}\frac{\partial ( \vert \vert y - X\beta \vert \vert_2^2}{\pm \partial \beta_i} \right) }{ \left( \frac{ \vert \vert \beta \vert \vert_1 )}{\pm \partial \beta_i}\right)} = \pm \frac{\partial ( \frac{1}{2n} \vert \vert y - X\beta \vert \vert_2^2}{\partial \beta_i} = \pm \frac{1}{n} x_i \cdot y $$
which equals the $\vert \vert X^Ty \vert \vert_\infty$ mentioned in the comments.
where we should notice that this is only true for the special case in which the tip of the polytope is touching the sphere (so this is not a general solution, although generalization is straightforward).
##Case 3: All $\beta_i$ are non-zero.##
In this case that a facet of the polytope is touching the sphere. Then the direction of change of the lasso path is normal to the surface of the particular facet.
The polytope has many facets, with positive and negative contributions of the $x_i$. In the case of the last lasso step, when the lasso solution is close to the ols solution, then the contributions of the $x_i$ must be defined by the sign of the OLS solution. The normal of the facet can be defined by taking the gradient of the function $\vert \vert \beta(r) \vert \vert_1 $, the value of the sum of beta at the point $r$, which is:
$$ n = - \nabla_r ( \vert \vert \beta(r) \vert \vert_1) = -\nabla_r ( \text{sign} (\hat{\beta}) \cdot (X^TX)^{-1}X^Tr ) = -\text{sign} (\hat{\beta}) \cdot (X^TX)^{-1}X^T $$
and the equivalent change of beta for this direction is:
$$ \vec{\beta}_{last} = (X^TX)^{-1}X n = -(X^TX)^{-1}X^T [\text{sign} (\hat{\beta}) \cdot (X^TX)^{-1}X^T]$$
which after some algebraic tricks with shifting the transposes ($A^TB^T = [BA]^T$) and distribution of brackets becomes
$$ \vec{\beta}_{last} = - (X^TX)^{-1} \text{sign} (\hat{\beta}) $$
we normalize this direction:
$$ \vec{\beta}_{last,normalized} = \frac{\vec{\beta}_{last}}{\sum \vec{\beta}_{last} \cdot sign(\hat{\beta})} $$
To find the $\lambda_{min}$ below which all coefficients are non-zero. We only have to calculate back from the OLS solution back to the point where one of the coefficients is zero,
$$ d = min \left( \frac{\hat{\beta}}{\vec{\beta}_{last,normalized}} \right)\qquad \text{with the condition that } \frac{\hat{\beta}}{\vec{\beta}_{last,normalized}} >0$$
,and at this point we evaluate the derivative (as before when we calculate $\lambda_{max}$). We use that for a quadratic function we have $q'(x) = 2 q(1) x$:
$$\lambda_{min} = \frac{d}{n} \vert \vert X \vec{\beta}_{last,normalized} \vert \vert_2^2 $$
##Images##
a point of the polytope is touching the sphere, a single $\beta_i$ is non-zero:
a ridge (or differen in multiple dimensions) of the polytope is touching the sphere, many $\beta_i$ are non-zero:
a facet of the polytope is touching the sphere, all $\beta_i$ are non-zero:
##Code example: ##
note: those last three lines are the most important
(edit notice: this answer had been edited a lot. I have deleted old parts. But left this answer from July 14 which contains a nice explanation of the separation of $y$, $\epsilon$ and $\hat{y}$. The problems that occured in this initial answer have been solved. The final step of the LARS algorithm is easy to find, it is the normal of the polytope defined by the sign of the $\hat{\beta}$) #ANSWER JULI 14 2017#
for $\lambda < \lambda_{max}$ we have at least one non-zero coefficient (and above all are zero)
for $\lambda < \lambda_{min}$ we have all coefficients non-zero (and above at least one coefficient is zero)
finding $\lambda_{max}$
You can use the following steps to determine $\lambda_{max}$ (and this technique will also help for $\lambda_{min}$ although a bit more difficult). For $\lambda>\lambda_{max}$ we have $\hat{\beta}^\lambda$ = 0, as the penalty in the term $\lambda \vert \vert \beta \vert \vert_1$ will be too large, and an increase of $\beta$ does not reduce $(1/2n) \vert \vert y-X \beta \vert \vert ^2_2$ sufficiently to optimize the following expression
(1) \begin{equation}\frac{1}{2n} \vert \vert y-X \beta \vert \vert ^2_2 + \lambda \vert \vert \beta \vert \vert_1 \end{equation}
The basic idea is that at some level of $\lambda$ the increase of $\vert \vert \beta \vert \vert_1$ will cause the term $(1/2n) \vert \vert y-X \beta \vert \vert ^2_2$ to decreases more than the term $\lambda \vert \vert \beta \vert \vert_1$ increases. This point at which the terms are equal can be calculated exactly and at this point the expression (1) can be optimized by an increase of $\beta$ and this is the point below which some terms of $\beta$ are non-zero.
$r_1= \frac{1}{2n} \frac{\partial\sum{(y - X\beta_0\cdot s)^2}}{\partial s}$
this is related to the angle between $\vec{\beta_0}$ and $\vec{y}$. The change of the square of the length of the error term $y_{err}$ will be equal to
$\frac{\partial y_{err}^2}{\partial \vert \vert \beta_0 \vert \vert _1} = 2 y_{err} \frac{\partial y_{err}}{\partial \vert \vert \beta_0 \vert \vert _1} = 2 \vert \vert y \vert \vert _2 \vert \vert X\beta_0 \vert \vert _2 cor(\beta_0,y)$
The term $\vert \vert X\beta_0 \vert \vert _2 cor(\beta_0,y)$ is how much the length $y_{err}$ changes as the coefficient $\beta_0$ changes and this includes a multiplication with $X$ (since the larger $X$ the larger the change as the coefficient changes) and a multiplication with the correlation (since only the projection part reduces the length of the error vector).
Then $\lambda_{max}$ is equal to this rate of change and $\lambda_{max} = \frac{1}{2n} \frac{\partial\sum{(y - X\beta_0\cdot s)^2}}{\partial s} = \frac{1}{2n} 2 \vert \vert y \vert \vert _2 \vert \vert x \vert \vert _2 corr(\beta_0,y) = \frac{1}{n} \vert \vert X\beta_0 \cdot y \vert \vert _2 $
Where $\beta_0$ is the unit vector that corresponds to the angle in the first step of the LARS algorithm.
If $k$ is the number of vectors that share the maximum correlation then we have
$\lambda_{max} = \frac{\sqrt{k}}{n} \vert \vert X^T y\vert \vert_\infty $
In other words. LARS gives you the initial decline of the SSE as you increase the length of the vector $\beta$ and this is then equal to your $\lambda_{max}$
##Graphical example##
The image below explains the concept how the LARS algorithm can help in finding $\lambda_{max}$. (and also hints how we can find $\lambda_{min}$)
Now, initially the path is followed along the vector $x_i$ that has the highest correlation with $y$. For this vector the change of the $\sqrt{SSE}$ as function of the change of $\vert \vert \beta \vert \vert$ is largest (namely equal to the correlation of that vector $x_i$ and $y$). While you follow the path this change of $\sqrt{SSE}$/$\vert \vert \beta \vert \vert$ decreases until a path along a combination of two vectors becomes more efficient, and so on. Notice that the ratio of the change $\sqrt{SSE}$ and change $\vert \vert \beta \vert \vert$ is a monotonously decreasing function. Therefore, the solution for $\lambda_{max}$ must be at the origin.
finding $\lambda_{min} $
The above illustration also shows how we can find $\lambda_{min}$ the point at which all of the coefficients are non zero. This occurs in the last step of the LARS procedure.
We can find this points with similar steps. We can determine the angle of the last step and the pre-last step and determine the point where these steps turn from the one into the other. This point is then used to calculate $\lambda_{min}$.
The expression is a bit more awkard since you do not calculate starting from zero. Actually, a solution may not exist. The vector along which the final step is made is in the direction of $X^T \cdot sign$ in which $sign$ is a vector with 1's and -1's depending on the direction of the change for the coefficient in the last step. This direction can be calculated if you know the result of the pre-last step (and could be achieved by itteration towards the first step but such a huge calculation does not seem to be what we want). I do not seem to find out if we can see directly what the sign is of the change of the coefficients in the last step.
Note note that we can more easily determine the point $\lambda_{opt}$ for which $\beta^\lambda$ is equal to the OLS solution. In the image this is related to the change of the vector $y_{\perp}$ as we move along the slope $\hat{\beta_n}$ (this might be more practical to calculate)
problems with sign changes
The LARS solution is close to the LASSO solution. A difference is in the number of steps. In the above method one would find out the direction of the last change and then go back from the OLS solution untill a coefficient becomes zero. A problem occurs if this point is not a point at which a coefficient is added to the set of active coefficients. It could be also a sign change and the LASSO and LARS solutions may differ in these points.