Lasso – Step-by-Step Derivation of Group Lasso

derivativelassoregularization

I've been reading the book Statistical Learning with Sparsity and I just came across the Group Lasso section.

I can follow the maths to the final derivation of the Group Lasso solutions when the ${\bf Z}_j$ are not orthonormal. Nonetheless, when this is the case, the solutions are said to be easier and have the form

$$\hat\beta_j = \left( 1 – \dfrac{\lambda}{\|{\bf Z}_j^T{\bf r}_j\|_2} \right)_+ \|{\bf Z}_j^T{\bf r}_j\|_2$$

The book uses the notation ${\bf Z}_j$ for each of the group features instead of ${\bf X}_j$.

I've seen the derivation of this equation in About the derivation of group Lasso but I don't get why $\|\beta_j\|_2 = \|S_j\|_2 – \lambda$ using the question notation (I omit the term $\sqrt p_j$ since Hastie doesn't use this penalization but it is conceptually the same). Taking that equality I get to the final point, but I don't understand where it comes from and the reasons behind it.

I, therefore, get stuck at $$\beta_j=\|S_j\|_2{\left( 1+\frac{\lambda}{\|\beta_j\|_2} \right)}^{-1}$$

Best Answer

It took me some time to understand this derivation. As usual, once you get the trick, it's actually straightforward.

To solve the group LASSO via block coordinate descent, we solve for each group of variables $\beta_j = (\beta_{j1}, \ldots, \beta_{jp_j})^\top$ separately, i.e. we need to solve $$ \arg\min_{\beta_j} \frac{1}{2} (\mathbf{Y} - \mathbf{X} \mathbf{\beta})^\top (\mathbf{Y} - \mathbf{X} \mathbf{\beta}) + \lambda \sum_{j=1}^J d_j \lVert \beta_j \rVert , \\ \Leftrightarrow \arg\min_{\beta_j} \frac{1}{2} (\mathbf{Y} - \mathbf{X}^{-j} \mathbf{\beta}_{-j} - \mathbf{X}^{j} \beta_j)^\top (\mathbf{Y} - \mathbf{X}^{-j} \mathbf{\beta}_{-j} - \mathbf{X}^{j} \beta_j) + \lambda \sum_{j=1}^J d_j \lVert \beta_j \rVert , \\ \Leftrightarrow \arg\min_{\beta_j} \frac{1}{2} (\mathbf{r}_j - \mathbf{X}^{j} \beta_j)^\top (\mathbf{r}_j - \mathbf{X}^{j} \beta_j) + \lambda \sum_{j=1}^J d_j \lVert \beta_j \rVert , \\ $$ where $\mathbf{X}^{j}$ denotes the columns of $\mathbf{X}$ corresponding to the $j$-th group (with $(\mathbf{X}^{j})^\top \mathbf{X}^{j} = \mathbf{I}$), $\mathbf{X}^{-j}$ denotes the matrix $\mathbf{X}$ without the columns corresponding to the $j$-th group, and $\beta_{-j}$ is the corresponding vector of coefficients.

Case I

Considering the case $\beta_j \neq \mathbf{0}$ and setting the derivative with respect to $\beta_j$ to zero, we obtain $$ -(\mathbf{X}^{j})^\top (\mathbf{r}_j - \mathbf{X}^{j} \beta_j) + \lambda d_j \frac{\beta_j}{\lVert \beta_j \rVert} = 0 , \\ \Leftrightarrow -(\mathbf{X}^{j})^\top \mathbf{r}_j + (\mathbf{X}^{j})^\top \mathbf{X}^{j} \beta_j + \lambda d_j \frac{\beta_j}{\lVert \beta_j \rVert} = 0 , \\ \Leftrightarrow -(\mathbf{X}^{j})^\top \mathbf{r}_j + \mathbf{I} \beta_j + \frac{\lambda d_j}{\lVert \beta_j \rVert}\beta_j = 0 , \\ \Leftrightarrow \left( 1 + \frac{\lambda d_j}{\lVert \beta_j \rVert} \right) \beta_j = (\mathbf{X}^{j})^\top \mathbf{r}_j , \\ \Leftrightarrow \beta_j = \left( 1 + \frac{\lambda d_j}{\lVert \beta_j \rVert} \right)^{-1} (\mathbf{X}^{j})^\top \mathbf{r}_j . $$

We still have $\beta_j$ on both sites, namely in the form of $\lVert \beta_j \rVert$, which we would like to eliminate. Let $\mathbf{s}_j = (\mathbf{X}^{j})^\top \mathbf{r}_j$, we take the last equation to express $\lVert \beta_j \rVert$ as $$ \lVert \beta_j \rVert = \left\Vert \left( 1 + \frac{\lambda d_j}{\lVert \beta_j \rVert} \right)^{-1} \mathbf{s}_j \right\Vert \\ = \sqrt{ \sum_{i=1}^n \left( 1 + \frac{\lambda d_j}{\lVert \beta_j \rVert} \right)^{-2} s_{ji}^2 } \\ = \sqrt{ \left( 1 + \frac{\lambda d_j}{\lVert \beta_j \rVert} \right)^{-2} \sum_{i=1}^n s_{ji}^2 } \\ = \left( 1 + \frac{\lambda d_j}{\lVert \beta_j \rVert} \right)^{-1} \lVert \mathbf{s}_j \rVert . $$ Now, solving for $\lVert \beta_j \rVert$, we get $$ \lVert \beta_j \rVert \left( 1 + \frac{\lambda d_j}{\lVert \beta_j \rVert} \right) = \lVert \mathbf{s}_j \rVert, \\ \Leftrightarrow \lVert \beta_j \rVert + \lambda d_j = \lVert \mathbf{s}_j \rVert, \\ \Leftrightarrow \lVert \beta_j \rVert = \lVert \mathbf{s}_j \rVert - \lambda d_j . $$ Thus, we can use this formulation of $\lVert \beta_j \rVert$ and substitute it in the equation above: $$ \beta_j = \left( 1 + \frac{\lambda d_j}{\lVert \beta_j \rVert} \right)^{-1} \mathbf{s}_j \\ = \left( 1 + \frac{\lambda d_j}{\lVert \mathbf{s}_j \rVert - \lambda d_j} \right)^{-1} \mathbf{s}_j \\ = \left( \frac{\lVert \mathbf{s}_j \rVert}{\lVert \mathbf{s}_j \rVert - \lambda d_j} \right)^{-1} \mathbf{s}_j \\ = \left(1 - \frac{\lambda d_j}{\lVert \mathbf{s}_j \rVert} \right) \mathbf{s}_j . $$

Case II

The vector norm is non-differentiable at $\mathbf{0}$, therefore we need to follow a different path if $\beta_j = \mathbf{0}$. We consider the subdifferential of $f(\beta_j) = \lVert \beta_j \rVert$ at $\beta_j = \mathbf{0}$, which has the form $$ \partial f(\beta_j) = \{ \mathbf{v} \in \mathbb{R}^{p_j} \mid f(\beta_j^\prime) \geq f(\beta_j) + \mathbf{v}^\top (\beta_j^\prime - \beta_j), \forall \beta_j^\prime \in \mathbb{R}^{p_j} \} \\ = \{ \mathbf{v} \in \mathbb{R}^{p_j} \mid \lVert \beta_j^\prime \rVert \geq \mathbf{v}^\top \beta_j^\prime, \forall \beta_j^\prime \in \mathbb{R}^{p_j} \} . $$ Thus, a subgradient vector $\mathbf{v}$ of $\lVert \beta_j \rVert$ at $\beta_j = \mathbf{0}$ needs to satisfy $\lVert \mathbf{v} \rVert \leq 1$.

The KKT conditions require $\mathbf{0} \in -(\mathbf{X}^{j})^\top (\mathbf{r}_j - \mathbf{X}^{j} \beta_j) + \lambda d_j \mathbf{v}$, so we obtain $$ -(\mathbf{X}^{j})^\top (\mathbf{r}_j - \mathbf{X}^{j} \beta_j) + \lambda d_j \mathbf{v} = -(\mathbf{X}^{j})^\top \mathbf{r}_j + \lambda d_j \mathbf{v} = \mathbf{0} , \\ \Leftrightarrow \lambda d_j \mathbf{v} = \mathbf{s}_j , \\ \Leftrightarrow \mathbf{v} = \frac{1}{\lambda d_j} \mathbf{s}_j . $$ Because, $\lVert \mathbf{v} \rVert \leq 1$, $\lVert \frac{1}{\lambda d_j} \mathbf{s}_j \rVert \leq 1 \Leftrightarrow \lVert \mathbf{s}_j \rVert \leq \lambda d_j$. Therefore, $\beta_j$ becomes zero if $\lVert \mathbf{s}_j \rVert \leq \lambda d_j$.

Combining both cases

It is straightforward to combine both cases into a single equation: $$ \beta_j = \left(1 - \frac{\lambda d_j}{\lVert \mathbf{s}_j \rVert} \right)_+ \mathbf{s}_j , $$ where $(\cdot)_+$ denotes the positive part of its argument.