Solved – Choosing the range and grid density for regularization parameter in LASSO

lassoregularization

I am studying LASSO (least absolute shrinkage and selection operator) at the meantime. I see that the optimal value for regularization parameter can be chosen by cross validation. I see also in ridge regression and many methods that apply regularization, we can use CV in order to find the optimal regularization parameter(saying penalty). Now my question is about initial values for upper and lower bound of parameter and how to determine the the length of the sequence.

To be specific, assume we have a LASSO problem
$$
LogLikelihood = (y-x\beta)'(y-x\beta) + \lambda \sum|\beta|_1
$$
and we want to find the optimal value for penalty, $\lambda$. Then how can we choose a lower and upper bound for $\lambda \in [a=?,b=?]$? and how many splits in between these two values $\frac{(b-a)}{k=?}$?

Best Answer

This methodology is described in the glmnet paper Regularization Paths for Generalized Linear Models via Coordinate Descent. Although the methodology here is for the general case of both $L^1$ and $L^2$ regularization, it should apply to the LASSO (only $L^1$) as well.

The solution for the maximum $\lambda$ is given in section 2.5.

When $\tilde\beta = 0$, we see from (5) that $\tilde\beta_j$ will stay zero if $ \frac{1}{N} | \langle x_j , y \rangle | < \lambda \alpha $. Hence $N \alpha \lambda_{max} = \max_l | \langle x_l , y \rangle |$

That is, we observe that the update rule for beta forces all parameter estimates to zero for $\lambda > \lambda_{max}$ as determined above.

The determination of $\lambda_{min}$ and the number of grid points seems less principled. In glmnet they set $\lambda_{min} = 0.001 * \lambda_{max}$, and then choose a grid of $100$ equally spaced points on the logarithmic scale.

This works well in practice, in my extensive use of glmnet I have never found this grid to be too coarse.

In the LASSO ($L^1$) only case things work better, as the LARS method provides a precise calculation for when the various predictors enter into the model. A true LARS does not do a grid search over $\lambda$, instead producing an exact expression for the solution paths for the coefficients. Here is a detailed look at the exact calculation of the coefficient paths in the two predictor case.

The case for non-linear models (i.e. logistic, poisson) is more difficult. At a high level, first a quadratic approximation to the loss function is obtained at the initial parameters $\beta = 0$, and then the calculation above is used to determine $\lambda_{max}$. A precise calculation of the parameter paths is not possible in these cases, even when only $L^1$ regularization is provided, so a grid search is the only option.

Sample weights also complicate the situation, the inner products must be replaced in appropriate places with weighted inner products.