Solved – Why does Group Lasso use L2 norm for individual group penalties

lassoregularization

In group lasso $$\min_{\beta}\left\{\frac{1}{2} \left\lVert{y}-\sum_{l=1}^mX^{(l)}{\beta^{(l)}}\right\rVert_2^2 +\lambda\sum_{l=1}^m\sqrt{p_l}\left\lVert{\beta^{(l)}}\right\rVert_q\right\}$$ the individual group penalties use L2 norm, that is $q = 2$. What's the intuition for this choice? Would the main property of group lasso (that entire groups may be eliminated) still hold if $q$ was set to 1 or some other value?

(I assume the presence of group size weighting $\sqrt{p_l}$ makes no difference for the intuition, and for group elimination property; I included it only because it seems to be the standard formulation.)

Best Answer

Any $q > 1$ would define an estimator which performs group-wise selection and is the minimizer of a convex function. When $q=1$, the estimator reduces to a (weighted) lasso which does not perform group-wise selection. When $q < 1$, the objective function is non-convex.

In the original paper,

Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49-67.

the motivation is given (in figure 1) that the penalty looks like the lasso's penalty for coefficients in different groups while looking like the ridge regression's penalty for coefficients within the same group. This suggests that an explanation for why the group lasso uses $q=2$ reduces to an explanation for the utility of ridge regression over other $\ell_q$ penalized regression estimators. The standard intuition for this is that in ridge regression the coefficients are treated as being neutrally, in that they are neither being encouraged to be nearly sparse (as when $q < 2$) or to be nearly equal to each other (as when $q > 2$).

A more practical explanation for why $q=2$ could just be that the statistical community is comfortable with $\ell_2$ penalization and that the choice $q=2$ made possible deriving a LARS-type algorithm for fast computation. In the time when this paper was published, it was standard for new convex penalized regression estimators to be accompanied with a LARS-type algorithm.

Fig. 1.:

Related Solutions

Solved – Why do Lars and Glmnet give different solutions for the Lasso problem

Finally we were able to produce the same solution with both methods! First issue is that glmnet solves the lasso problem as stated in the question, but lars has a slightly different normalization in the objective function, it replaces $\frac{1}{2N}$by $\frac{1}{2}$. Second, both methods normalize the data differently, so the normalization must be swiched off when calling the methods.

To reproduce that, and see that the same solutions for the lasso problem can be computed using lars and glmnet, the following lines in the code above must be changed:

la <- lars(X,Y,intercept=TRUE, max.steps=1000, use.Gram=FALSE)

la <- lars(X,Y,intercept=TRUE, normalize=FALSE, max.steps=1000, use.Gram=FALSE)

and

glm2 <- glmnet(X,Y,family="gaussian",lambda=0.5*la$lambda,thresh=1e-16)

glm2 <- glmnet(X,Y,family="gaussian",lambda=1/nbSamples*la$lambda,standardize=FALSE,thresh=1e-16)

Solved – About the derivation of group Lasso

the derivation of L2 norm is follow:
1. $\frac{\beta_j}{\|\beta_j\|}$ when $\beta_j \ne 0 $.
2. any vector with $ \| \beta_j \| \le 1 $ when $beta_j = 0$.
So when combing these two formula together, you can get the plus sign in the formula.

Best Answer

Related Solutions

Solved – Why do Lars and Glmnet give different solutions for the Lasso problem

Solved – About the derivation of group Lasso

Related Question