Solved – the mathematical rigorous proof that L1 regularization will give sparse solution?

machine learningregularization

It is given in the book Machine Learning A probabilistic Perspective, but i am not able to understand it. Can some one provide an explanation for that ?

I am not clear with the way sub gradient is defined and how is it helping in the optimization enter image description here enter image description here.enter image description here

Best Answer

Answer

As shown in my answer at http://stats.stackexchange.com/questions/210040/, the lasso estimator $$\hat\beta_\lambda = \arg\min_{\beta \in \mathbb{R}^p} \left\{ \|y - X \beta\|_2^2 + \lambda \|\beta\|_1 \right\}$$ equals zero if and only if $\lambda \geq \frac{1}{n} \|X^T y\|_\infty =: \lambda_\max$. This result does, in some sense, answer your question: we now know that for a large enough tuning parameter $\lambda$, the lasso estimator $\hat\beta_\lambda$ is fully sparse. On the other hand, we also know that there exists a least squares estimator $\hat\beta_0$ which is almost surely fully dense.

More

assumption: For convenience, let us assume that every estimator considered is unique. (If the estimators are not unique, all we've shown is that there exists at least one estimator that is sparse, which also answers the question.)

In light of this, we may be interested in knowing more about the possible sizes of the support of $\hat\beta_\lambda$ for $\lambda \in (0, \lambda_\max)$. In particular, we'll now answer whether it's true that for all $k \in \{0, 1, \dots, p\}$, there exists $\lambda$ so that $\mathrm{card} (\mathrm{supp} (\hat\beta_\lambda)) = k$. It turns out that the answer is (almost surely) yes! The best proof I can come up with relies on explicitly constructing the lasso solution. (If there's a better proof you know, please do comment!)

Since we've answered the other cases above, restrict $k \in (1, \dots, p-1)$. Write the knots of the lasso solution path as $0 \leq \lambda_1 \leq \dots \leq \lambda_q$. Let $\lambda_{r+1}$ be the largest knot so that $\mathrm{card}(\mathrm{supp}(\hat\beta_\lambda)) \leq k$ for all $\lambda > \lambda_{r+1}$. By definition, we know that a feature will leave the support at the knot $\lambda_{r+1}$. We will know show that exactly one will leave the support.

From KKT considerations (to be filled in later), we see that, for $\lambda \in [\lambda_r, \lambda_{r+1}]$, $$\hat\beta_\lambda^j = \hat\beta_{\lambda_r}^j + (\lambda_r - \lambda) e_j^T (X_S^T X_S)^{-1} z_S$$ for $j \in S$, and $\hat\beta_\lambda^j = 0$ for $j \not\in S$. Here $S = \mathrm{supp} (\hat\beta_{\lambda_r})$ and $z$ is the subgradient of the $\ell_1$ norm evaluated at $\hat\beta_{\lambda_r}$. Assuming that $j$ is one of the features that leaves the active set at the knot $\lambda_{r+1}$, we can set the above equation to zero and solve for $\lambda$ to find that $$\lambda_{r+1} = \lambda_r + \frac{\hat\beta_j^{\lambda_r}}{e_j^T (X_S^T X_S)^{-1} z_S}.$$ (Note that if the denominator were zero, then the $j^\textrm{th}$ feature could not be leaving the active set at the knot $\lambda_{r+1}$.) At this point, we see that the proof will conclude when it is shown that $$\lambda_j^\mathrm{update} := \frac{\hat\beta_j^{\lambda_r}}{e_j^T (X_S^T X_S)^{-1} z_S}$$ is distinct for each $j \in S$. However, the subgradient $z_S \in \{-1,1\}^{|S|}$, a finite set, and so it follows that under a continuous distribution on the response $y$, the update $\lambda_j^\mathrm{update}$ is almost surely unique for each $j \in S$.

Related Question