Regression – Why Use Group Lasso Instead of Lasso?

feature selectionlassoregressionregularization

I have read the that the group lasso is used for variable selection and sparsity in a group of variables. I want to know the intuition behind this claim.

Why is group lasso preferred to lasso?
Why is the group lasso solution path not piecewise linear?

Best Answer

Intuitively speaking, the group lasso can be preferred to the lasso since it provides a means for us to incorporate (a certain type of) additional information into our estimate for the true coefficient $\beta^*$. As an extreme scenario, considering the following:

With $y \sim \mathcal{N} (X \beta^*, \sigma^2 I )$, put $S = \{j : \beta^*_j \neq 0 \}$ as the support of $\beta^*$. Consider the "oracle" estimator $$\hat{\beta} = \arg\min_{\beta} \|y - X \beta\|_2^2 + \lambda \left( |S|^{1/2} \|\beta_S\|_2 + (p-|S|)^{1/2} \|\beta_{S^C}\|_2 \right),$$ which is the group lasso with two groups--one the true support and one the complement. Let $\lambda_{max}$ be the smallest value of $\lambda$ that makes $\hat{\beta} = 0$. Due to the nature of the group lasso penalty, we know that at $\lambda$ moves from $\lambda_{max}$ to $\lambda_{max} - \epsilon$ (for some small $\epsilon > 0$), exactly one group will enter into support of $\hat{\beta}$, which is popularly considered as an estimate for $S$. Due do our grouping, with high probability, the selected group will be $S$, and we'll have done a perfect job.

In practice, we don't select the groups this well. However, the groups, despite being finer than the extreme scenario above, will still help us: the choice would still be made between a group of true covariates and a group of untrue covariates. We're still borrowing strength.

This is formalized here. They show, under some conditions, that the an upper bound on the prediction error of the group lasso is lower than a lower bound on the prediction error of the plain lasso. That is, they proved that the grouping makes our estimation do better.

For your second question: The (plain) lasso penalty is piecewise linear, and this gives rise to the piecewise linear solution path. Intuitively, in the group lasso case, the penalty is no longer piecewise linear, so we no longer have this property. A great reference on piecewise linearity of solution paths is here. See their proposition 1. Let $L(\beta) = \|y - X \beta\|_2^2$ and $J(\beta) = \sum_{g \in G} |g|^{1/2} \|\beta_g\|_2$. They show that the solution path of the group lasso is linear if and only if $$\left( \nabla^2L(\hat{\beta}) + \lambda \nabla^2 J(\hat{\beta}) \right)^{-1} \nabla J(\hat{\beta})$$ is piecewise constant. Of course, it isn't since our penalty $J$ has global curvature.

Best Answer

Related Solutions

Lasso Regression – Why Lasso Provides Variable Selection Benefits

Solved – Why use Lasso estimates over OLS estimates on the Lasso-identified subset of variables

Related Question