Solved – Lasso regression solutions

lassoregression

In the ridge regression $\hat{\beta}^{ridge}=argmin\left \{ \frac{1}{2}\sum_{i=1}^{N}(y_i-\beta_0-\sum_{j=1}^{p}x_{ij}\beta_j)^2 +\lambda \sum_{j=1}^{p}\beta_j^2 \right \}$ the ridge regression solutions can be computed form $\hat{\beta}^{ridge}=(X^tX+\lambda I)^{-1}X^ty$

But for the lasso regression case: $\hat{\beta}^{lasso}=argmin\left \{ \frac{1}{2}\sum_{i=1}^{N}(y_i-\beta_0-\sum_{j=1}^{p}x_{ij}\beta_j)^2 +\lambda \sum_{j=1}^{p}|\beta_j| \right \}$
With which matrix form formula can I compute the lasso regression solutions? I looked in the Elements of statistical learning book and that wasn't anywhere, only the beta solutions for ridge.

Best Answer

With which matrix form formula can I compute the lasso regression solutions?

As @Matthew Drury points out there is no closed form solution to the multivariate lasso problem. To understand why this is the case you must first familiarise yourself with the closed form solution to the univariate lasso problem which is derived here:

Univariate Lasso problem

Computing the subdifferential of the Lasso cost function and equating to zero to find the minimum:

\begin{aligned} \partial_{\theta_j} RSS^{lasso}(\theta) &= \partial_{\theta_j} RSS^{OLS}(\theta) + \partial_{\theta_j} \lambda || \theta ||_1 \\ 0 & = -\rho_j + \theta_j z_j + \partial_{\theta_j} \lambda || \theta_j || \\ 0 & = \begin{cases} -\rho_j + \theta_j z_j - \lambda & \text{if}\ \theta_j < 0 \\ [-\rho_j - \lambda ,-\rho_j + \lambda ] & \text{if}\ \theta_j = 0 \\ -\rho_j + \theta_j z_j + \lambda & \text{if}\ \theta_j > 0 \end{cases} \end{aligned}

For the second case we must ensure that the closed interval contains the zero so that $\theta_j = 0$ is a global minimum

\begin{aligned} 0 \in [-\rho_j - \lambda ,-\rho_j + \lambda ] \end{aligned}

Solving for $\theta_j$ gives:

\begin{aligned} \begin{cases} \theta_j = \frac{\rho_j + \lambda}{z_j} & \text{for} \ \rho_j < - \lambda \\ \theta_j = 0 & \text{for} \ - \lambda \leq \rho_j \leq \lambda \\ \theta_j = \frac{\rho_j - \lambda}{z_j} & \text{for} \ \rho_j > \lambda \end{cases} \end{aligned}

We recognize this as the soft thresholding function with a normalizing constant.

Multivariate Lasso problem

$$ min_{\theta} \ \frac{1}{2} \ || \mathbf{y - X \theta}||_2^2 + \lambda || \theta||_1$$

Taking the subdifferential and ensuring that the zero is contained in the closed interval (i.e. stationary conditions for global minimum) gives:

$$ 0 \in \mathbf{X^T(y - X \theta) + \partial (\lambda || \theta||_1)}$$

Re-arranging we get

$$ \mathbf{X^T(y - X\theta)} \in \lambda || \theta ||_1$$

Where $\theta$ appears on both sides of the equation. You can only solve this explicitely if $X^TX = I$ which means that $X$ is orthonormal, since you could then use the soft-thresolding function to write explicit solutions for each component in $\theta$

Related Solutions

Solved – Why do Lars and Glmnet give different solutions for the Lasso problem

Finally we were able to produce the same solution with both methods! First issue is that glmnet solves the lasso problem as stated in the question, but lars has a slightly different normalization in the objective function, it replaces $\frac{1}{2N}$by $\frac{1}{2}$. Second, both methods normalize the data differently, so the normalization must be swiched off when calling the methods.

To reproduce that, and see that the same solutions for the lasso problem can be computed using lars and glmnet, the following lines in the code above must be changed:

la <- lars(X,Y,intercept=TRUE, max.steps=1000, use.Gram=FALSE)

la <- lars(X,Y,intercept=TRUE, normalize=FALSE, max.steps=1000, use.Gram=FALSE)

and

glm2 <- glmnet(X,Y,family="gaussian",lambda=0.5*la$lambda,thresh=1e-16)

glm2 <- glmnet(X,Y,family="gaussian",lambda=1/nbSamples*la$lambda,standardize=FALSE,thresh=1e-16)

Solved – Equivalence between Elastic Net formulations

Starting from $$\hat{\beta} = \arg \min_\beta \|X\beta - y\|_2^2 \text{ s.t. } (1-\alpha)\|\beta\|_1 + \alpha\|\beta\|_2^2 \leq t,$$ we can write the dual Lagragian formulation of this optimization problem as $$ \begin{array}{rcl} L(\beta,\alpha,\lambda) & = & \|X\beta - y\|_2^2 + \lambda \left( (1-\alpha)\|\beta\|_1 + \alpha\|\beta\|_2^2 - t\right) \\ & = & \|X\beta - y\|_2^2 + \lambda (1-\alpha)\|\beta\|_1 + \lambda\alpha\|\beta\|_2^2 - \lambda t, \end{array} $$ and we see that this indeed looks like the first problem that you wrote, with parameters $\lambda_1=\lambda (1-\alpha)$ and $\lambda_2=\lambda \alpha$, which leads to the expression of the "elastic" parameter: $$\alpha = \frac{\lambda_2}{\lambda_1+\lambda_2}.$$ That being said, to go from this point to Zou and Hastie's assertion that both problems are equivalent, I admit that I miss a step or two...