Sparse group lasso derivation of soft thresholding operator via subgradient equations

convex optimizationlinear regressionsubgradient

In the sparse group lasso paper SGLpaper, they derive a condition for $\beta=0$ from the equation:
$$ \frac{1}n {X^{(k)}}^T(y-\sum_{l=1}^m X^{(l)} \beta^{(l)}) = (1-\alpha)\lambda u+ \alpha \lambda v$$
where m is the number of groups. Similar to linear regression model, $X$ is the n by p design matrix,$X^{(l)}$ is the submatrix of design matrix with each $X^{(k)}$ an n by $p_l$ matrix, where $p_l$ is the number of covariates in group $l$, $Y$ is the n response vector, $\beta$ is the coefficients to be estimated, $\alpha$ and $\lambda$ nonnegative parameters that are smaller than or equal to 1. u and v are subgradients
of $||\beta^{(k)} ||_2$ and $||\beta^{(k)} ||_1$ respectively evaluated at $\beta^{(k)}$.
$$ u \in \{u:||u ||_2\ \leq1\}\ if B^{(k)}=0$$
$$ v_j \in \{v_j:|v_j|\ \leq1\}\ if B_j^{(k)}=0$$
They say 'With a little bit of algebra, we see that the subgradient equations are satisfied with $B^{(k)}=0$ if:
$$ ||S({X^{(k)}}^T r_{(-k)}/n, \alpha \lambda)||_2 \leq (1-\alpha)\lambda$$
where $r_{(-k)}$ the partial residual of y, subtracting all group fits other than group k:
$$r_{(-k)}=y-\sum_{l \neq k} X^{(l)} \beta^{(l)}$$
and with $S(·)$ the coordinate-wise soft thresholding operator:
$$(S(z, \alpha \lambda))_j = sign(z_j)(|z_j| − \alpha \lambda)_+ .$$

I tried to derive same inequality but I ended up with these:
$$||\frac{1}n {X^{(k)}}^T r_{(-k)} – \alpha \lambda [-1,1]||_2 \leq (1-\alpha)\lambda $$

$$||\ |\frac{1}n {X^{(k)}}^T r_{(-k)}| – \alpha \lambda ||_2 \leq (1-\alpha)\lambda $$
But I couldn't figure out how to apply soft thresholding operator.

I would really appreciate help on solving my question. Thanks very much.

Best Answer

You already found $$||\frac{1}{n} {X^{(k)}}^T r_{(-k)} - \alpha \lambda [-1,1]||_2 \leq (1-\alpha)\lambda, \qquad (1) $$ where you can choose any value in $[-1,1]$ to satisfy the inequality, for each element of the vector separately. Let us see how you would select this value for the $j^{th}$ component. The goal is to satisfy inequality (1), so you want the component to be close to $0$. The value you select depends on $$\frac{1}{n}\left({X^{(k)}}^T r_{(-k)}\right)_j \quad (2)$$ You select $1$ (from the interval $[-1,1]$ when (2) is larger than $\alpha\lambda$, $-1$ when (2) is smaller than $-\alpha\lambda$, and (2) divided by $\alpha\lambda$ when (2) is in $[-\alpha\lambda,\alpha\lambda]$. The last displayed formula in your question does not take the last case into account, while the $(\cdot)_+$ operator does.

When (2) is larger than $\alpha\lambda$, the $j^{th}$ component equals: $$\frac{1}{n}\left({X^{(k)}}^T r_{(-k)}\right)_j-\alpha\lambda.$$ When (2) is smaller than $\alpha\lambda$, the $j^{th}$ component equals: $$\frac{1}{n}\left({X^{(k)}}^T r_{(-k)}\right)_j+\alpha\lambda.$$ When (2) is in $[-\alpha\lambda,\alpha\lambda]$, the $j^{th}$ component equals $0$. For each of these cases, it is easy to see that they equal $$\text{sign}\left(\frac{1}{n}\left({X^{(k)}}^T r_{(-k)}\right)_j\right)\left(\left|\frac{1}{n}\left({X^{(k)}}^T r_{(-k)}\right)_j\right|- \alpha \lambda\right)_+.$$ Note that the sign operator can be omitted, since it does not affect the value of the 2-norm in (1).