Solved – Why does Group Lasso use L2 norm for individual group penalties

lassoregularization

In group lasso $$\min_{\beta}\left\{\frac{1}{2} \left\lVert{y}-\sum_{l=1}^mX^{(l)}{\beta^{(l)}}\right\rVert_2^2 +\lambda\sum_{l=1}^m\sqrt{p_l}\left\lVert{\beta^{(l)}}\right\rVert_q\right\}$$ the individual group penalties use L2 norm, that is $q = 2$. What's the intuition for this choice? Would the main property of group lasso (that entire groups may be eliminated) still hold if $q$ was set to 1 or some other value?

(I assume the presence of group size weighting $\sqrt{p_l}$ makes no difference for the intuition, and for group elimination property; I included it only because it seems to be the standard formulation.)

Best Answer

Any $q > 1$ would define an estimator which performs group-wise selection and is the minimizer of a convex function. When $q=1$, the estimator reduces to a (weighted) lasso which does not perform group-wise selection. When $q < 1$, the objective function is non-convex.

In the original paper,

Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49-67.

the motivation is given (in figure 1) that the penalty looks like the lasso's penalty for coefficients in different groups while looking like the ridge regression's penalty for coefficients within the same group. This suggests that an explanation for why the group lasso uses $q=2$ reduces to an explanation for the utility of ridge regression over other $\ell_q$ penalized regression estimators. The standard intuition for this is that in ridge regression the coefficients are treated as being neutrally, in that they are neither being encouraged to be nearly sparse (as when $q < 2$) or to be nearly equal to each other (as when $q > 2$).

A more practical explanation for why $q=2$ could just be that the statistical community is comfortable with $\ell_2$ penalization and that the choice $q=2$ made possible deriving a LARS-type algorithm for fast computation. In the time when this paper was published, it was standard for new convex penalized regression estimators to be accompanied with a LARS-type algorithm.


Fig. 1.:

enter image description here

Related Question