Lasso – Lambda Value Correspondence in LARS Implementing LASSO

larslasso

We know that the modified LARS is used to implement LASSO. However what's the corresponding lambda? As there is no lambda parameter in LARS.

I found reference here:

LASSO regularisation parameter from LARS algorithm

which said:

At each iteration 𝑘, the former algorithm finds an optimal couple $(\beta^*, \lambda^*)$ minimising the regularised loss function:
\begin{align}
(\beta^*, \lambda^*) &= \text{argmin}_{(\beta,\lambda)} L(\beta,\lambda) =\text{argmin}_{(\beta,\lambda)}\Vert y-X\beta \Vert_2^2 + \lambda \Vert \beta \Vert_1&
\end{align}

But I think $\lambda$ minimizing loss function should be 0?

And another reference

LASSO: Deriving the smallest lambda at which all coefficient are zero

said the descent direction in LARS is the direction minizine the ratio of change between square loss and $L_1$ norm loss: $$\dfrac{\nabla_{\vec{s}}||y-X\beta||_2^2}{\nabla_{\vec{s}}||\beta||_1}.$$
From this, I guess $\lambda = 1$ in LARS?

Edit

Here is my understanding of LARS:

We will go through the 'angular bisector' of the features in active set until appearing the new feature having smaller angle with the residual.
Then we put this feature into active set. When there is no 'angular bisector' (min(number of sample. number of features)) or the residual is vertical to the feature space, LARS stops.
We sum the corresponding step lengths from each step of each feature $\beta_i^*$ to get the final solution $\beta^*.$

For entire solution of pairs, does it mean whenever before a new feature added into the active set, the current solution $\beta^*_t$ corresponds a LASSO solution whose regularization coefficient $\lambda^*_t.$ Namely as the algorithm goes on, the corresponding LASSO $\lambda^*_t$ gets smaller and smaller and more and more features are no longer zero.

Best Answer

(Modified) LARS gives a sequence of coefficient estimates, call it $(\hat\beta_1,\hat\beta_2,\hat\beta_3,\dots\hat\beta_k)$. Lasso gives a path $(\tilde\beta_\lambda,\lambda)$ containing the optimal $\tilde\beta_\lambda$ for any $\lambda$, or equivalently, a path $(t,\tilde\beta_t)$ where $\tilde\beta_t$ minimises the residual sum of squares subject to $\|\tilde\beta_t\|_1\leq t$

The important claim about LARS and lasso is that

every $\hat\beta_i$ is also a $\tilde\beta_t$ for some decreasing sequence $\lambda_i$
The path $(t,\tilde\beta_t)$ is linear between the values picked out by (modified) LARS

If you want to know which $\hat\beta_i$ corresponds to a given $t$ it's easy: $t=\|\hat\beta_i\|_1$. If you want to know what $\lambda$ corresponds to a given $\hat\beta_i$, you can work it out because of the soft-thresholding property: the elements of $(X^TX\hat\beta_i-X^TXy)$ where $\hat\beta_i$ is not zero are equal to $\lambda/2$ in absolute value.

Related Solutions

Solved – Why do Lars and Glmnet give different solutions for the Lasso problem

Finally we were able to produce the same solution with both methods! First issue is that glmnet solves the lasso problem as stated in the question, but lars has a slightly different normalization in the objective function, it replaces $\frac{1}{2N}$by $\frac{1}{2}$. Second, both methods normalize the data differently, so the normalization must be swiched off when calling the methods.

To reproduce that, and see that the same solutions for the lasso problem can be computed using lars and glmnet, the following lines in the code above must be changed:

la <- lars(X,Y,intercept=TRUE, max.steps=1000, use.Gram=FALSE)

la <- lars(X,Y,intercept=TRUE, normalize=FALSE, max.steps=1000, use.Gram=FALSE)

and

glm2 <- glmnet(X,Y,family="gaussian",lambda=0.5*la$lambda,thresh=1e-16)

glm2 <- glmnet(X,Y,family="gaussian",lambda=1/nbSamples*la$lambda,standardize=FALSE,thresh=1e-16)

Solved – Why use Lasso estimates over OLS estimates on the Lasso-identified subset of variables

I don't believe there is anything wrong with using LASSO for variable selection and then using OLS. From "Elements of Statistical Learning" (pg. 91)

...the lasso shrinkage causes the estimates of the non-zero coefficients to be biased towards zero and in general they are not consistent [Added Note: This means that, as the sample size grows, the coefficient estimates do not converge]. One approach for reducing this bias is to run the lasso to identify the set of non-zero coefficients, and then fit an un-restricted linear model to the selected set of features. This is not always feasible, if the selected set is large. Alternatively, one can use the lasso to select the set of non-zero predictors, and then apply the lasso again, but using only the selected predictors from the first step. This is known as the relaxed lasso (Meinshausen, 2007). The idea is to use cross-validation to estimate the initial penalty parameter for the lasso, and then again for a second penalty parameter applied to the selected set of predictors. Since the variables in the second step have less "competition" from noise variables, cross-validation will tend to pick a smaller value for $\lambda$ [the penalty parameter], and hence their coefficients will be shrunken less than those in the initial estimate.

Another reasonable approach similar in spirit to the relaxed lasso, would be to use lasso once (or several times in tandem) to identify a group of candidate predictor variables. Then use best subsets regression to select the best predictor variables to consider (also see "Elements of Statistical Learning" for this). For this to work, you would need to refine the group of candidate predictors down to around 35, which won't always be feasible. You can use cross-validation or AIC as a criterion to prevent over-fitting.

Best Answer

Related Solutions

Solved – Why do Lars and Glmnet give different solutions for the Lasso problem

Solved – Why use Lasso estimates over OLS estimates on the Lasso-identified subset of variables

Related Question