Solved – Coordinate descent for lasso

lassoregressionregularization

I was reading abut the coordinate descent procedure for lasso. I have to write about it. I understood how it works but what I don't get are the formulas. Everywhere I see some additions or something is different.

in these slides page 14 the $\beta$ update is:

$\beta_i = \frac{S_\lambda}{\left \| X_i \right \|_2^2} \left ( \frac{X_i^T (y-X_{-i}\beta_{-i})}{X_i^T X_i} \right )=\frac{S_\lambda}{\left \| X_i \right \|_2^2} \left ( \frac{X_i^T (y-X_{-i}\beta_{-i})}{\left \| X_i \right \|_2^2} \right )$

What I don't really understand here is why is the soft thresholding being divided by $\left \| X_i \right \|_2^2$?
Should then the coefficient update be:

$\tilde{\beta}_j(\lambda)\leftarrow \frac{1}{\left \| X_i \right \|_2^2} S \left (\frac{X_i^T (y-X_{-i}\beta_{-i})}{\left \| X_i \right \|_2^2} ,\lambda\right )$ ?

I would very much appreciate any clarification, or any suggestion of what I could read to understand it better.

Best Answer

For the full derivation - have a look at this post

To answer your question shortly

What I don't really understand here is why is the soft thresholding being divided by $‖X_i‖_2^2$? Should then the coefficient update be

This happens when the data you are working with has not been normalized before the coordinate descent. Going through the derivation you find that the optimization problem uses sub-differentials and that the solution is of the form

\begin{aligned} \begin{cases} \theta_j = \frac{\rho_j + \lambda}{z_j} & \text{for} \ \rho_j < - \lambda \\ \theta_j = 0 & \text{for} \ - \lambda \leq \rho_j \leq \lambda \\ \theta_j = \frac{\rho_j - \lambda}{z_j} & \text{for} \ \rho_j > \lambda \end{cases} \end{aligned}

We recognize this as the soft thresholding function $\frac{1}{z_j} S(\rho_j , \lambda)$ where $\frac{1}{z_j}$ is a normalizing constant which is equal to $1$ when the data is normalized, i.e. $z_j = \sum_{i=1}^m (x_j^{(i)})^2 = ‖X_i‖_2^2$

Coordinate descent update rule:

Repeat until convergence or max number of iterations:

  • For $j = 0,1,...,n$
  • Compute $\rho_j = \sum_{i=1}^m x_j^{(i)} (y^{(i)} - \sum_{k \neq j}^n \theta_k x_k^{(i)} ]$
  • Compute $z_j = \sum_{i=1}^m (x_j^{(i)})^2$
  • Set $\theta_j = \frac{1}{z_j} S(\rho_j, \lambda)$

Now in practice, you will find that most people and most books tell you to normalize the data before you perform lasso and coordinate descent. I will tell you that when I have implemented lasso via coordinate descent - I could only make it work with normalized data. But that might be my own failings only.