LASSO – Determining Regularisation Parameter Using LARS Algorithm

larslassoregressionregularization

In their seminal paper 'Least Angle Regression', Efron et al describe a simple modification of the LARS algorithm which allows to compute full LASSO regularisation paths.

I have implemented this variant sucessfully and usually plot the output path either versus the number of steps (successive iterations of the LARS algorithm) or the $l_1$-norm of the regression coefficients ($\Vert \beta \Vert_1$).

Yet, it seems like most of the packages available out there provide the regularisation path in terms of the LASSO penalisation coefficient $\lambda$ (e.g. LARS in R, where you can play with the 'mode' argument to switch between different representations).

My question is: what is the mechanics used to switch from one representation to the other(s). I have seen various questions related to that (or more specifically the issue of mapping the inequality constraint $\Vert \beta \Vert_1 \leq t$ to an appropriate penalisation term $\lambda \Vert \beta \Vert_1 $), but have found no satisfying answer.


[Edit]

I have looked inside some MATLAB code that performs the required transformation and, for each LARS step $k$, this is how $\lambda$ seems to be computed:

$$ \lambda(k) = \max( 2 \vert X^T y \vert ),\ \ \ \text{for } k=1 $$
$$ \lambda(k) = \text{median}( 2 \vert X_{\mathcal{A}_k}^T r_{\mathcal{A}_k} \vert ),\ \ \ \forall k > 1$$

where $X$ (size $n \times p$) and $y$ (size $n \times 1$) denote the standardised inputs/response, $\mathcal{A}_k$ represents the active predictors' set at step $k$ and $r$ represents the current regression residual at step $k$.

I can't grasp the logic behind that calculation. Could someone help?

Best Answer

I have figured out how to perform the required conversion.

Assume that the inputs $X$ are standardised (zero mean, unit variance) and the responses $y$ are centered.

We know that the modified LARS algorithm provides the full LASSO regularisation path, cf. original paper by Efron et al.

This means that, at each iteration $k$, the former algorithm finds an optimal couple $(\beta^*, \lambda^*)$ minimising the regularised loss function: \begin{align} (\beta^*, \lambda^*) &= \text{argmin}_{(\beta,\lambda)} L(\beta,\lambda) \\ L(\beta,\lambda) &= \Vert y-X\beta \Vert_2^2 + \lambda \Vert \beta \Vert_1 \\ &= \sum_{i=1}^N \left(y_i - \sum_{j=1}^p \beta_j X_{ij}\right)^2 + \lambda \sum_{j=1}^p \vert \beta_j \vert \end{align}

For all active components $a=\{1,...,q\}$ in the active set $\mathcal{A}_k$ at the end of step $k$, applying the KKT stationarity condition gives \begin{align} 0 &= \frac{\partial L}{\partial \beta_a}(\beta^*,\lambda^*) \\ &= -2 \sum_{i=1}^N X_{ia} \left(y_i - \sum_{j=1}^q \beta_j^* X_{ij}\right) + \lambda^*\ \text{sign}(\beta_a^*) \end{align}

In other words $$ \lambda^* = 2 \frac{\sum_{i=1}^N X_{ia} \left(y_i - \sum_{j=1}^q \beta_j^* X_{ij}\right)}{\text{sign}(\beta_a^*)} $$ or in matrix notations (noting that dividing/multiplying by $\text {sign}(x)$ is the same) the following equation is satisfied for any active component $a$: $$ \lambda^* = 2 \ \text{sign}(\beta_a^*) X_a^T r $$

In the original paper, authors mention that for any solution to the LASSO problem, the sign of an active regression weight ($\beta_a^*$) should be identical to the sign of the corresponding active predictor's correlation with the current regression residual ($X_a^T r$), which is only logic since $\lambda^*$ must be positive. Thus we can also write:

$$ \lambda^* = 2 \vert X_a^T r \vert $$

In addition, we see that at the final step $k$ (OLS fit, $\beta^* = (X^TX)^{-1}X^T y $), we get $\lambda^* = 0$ due to the orthogonality lemma. The use of the median in the MATLAB implementation I found IMHO seems like an effort to 'average out' numerical errors over all the active components:

$$ \lambda^* = \text{median}( 2 \vert X_{\mathcal{A}_k}^T r_{\mathcal{A}_k} \vert ),\ \ \ \forall k > 1$$

To compute the value of $\lambda$ when there are no active components (step $k=1$), one can use the same trick as above but in the infinitesimal limit where all regression weights are zero and only the sign of the first component $b$ to become active (at step $k=2$) matters. This yields:

$$ \lambda^* = 2 \ \text{sign}(\beta_b^*) X_b^T y $$ which is strictly equivalent to

$$ \lambda^* = \max(2 \vert X^T y \vert), \text { for } k=1$$

because (i) same remark as earlier concerning the sign of regression weights; (ii) the LARS algorithm determines the next component $b$ to enter the active set as the one which is the most correlated with the current residual, which at step $k=1$ is simply $y$.