In compressed sensing, there is a theorem guarantee that
$$\text{argmin} \Vert c \Vert_1\\
\text{subject to } y = Xc
$$
has a unique sparse solution $c$ (See appendix for more details).
Is there a similar theorem for lasso? If there is such a theorem, not only will it guarantee the stability of lasso, but it will also provide lasso with a more meaningful interpretation:
lasso can uncover the sparse regression coefficient vector $c$ that is used to generate the response $y$ by $y = Xc$.
There are two reasons that I ask this question:
-
I think 'lasso favors a sparse solution' is not an answer to why use lasso for feature selection since we can't even tell what the advantage of the features we select is.
-
I learned lasso is notorious for being unstable for feature selection. In practice, we have to run bootstrap samples to evaluate its stability. What is the most crucial reason that causes this instability?
Appendix:
Given $X_{N \times M} = (x_1, \cdots, x_M)$. $c$ is a $\Omega$-sparse vector ($\Omega \leqslant M$). The process $y = Xc$ generates the response $y$. If $X$ has the NSP (null space property) of order $\Omega$ and covariance matrix of $X$ has no eigenvalue close to zero, there will be a unique solution to
$$\text{argmin} \Vert c \Vert_1\\
\text{subject to } y = Xc
$$
which is exactly the $c$ that gives $y$.
What this theorem also tells is also if $X$ has not the NSP of order $\Omega$, it is simply hopeless to solve $\text{argmin}_{c: y = Xc} \Vert c \Vert_1$.
EDIT:
After receiving these great answers, I realized I was confused when I was asking this question.
Why this question is confusing:
I read a research paper in which we have to decide how many features (columns) the design matrix $X_{N \times M}$ is going to have (auxiliary features are created from primary features). Since it is a typical $n < p$ problem, $D$ is expected to be well constructed so that the solution to lasso can be a good approximation of the real sparse solution.
The reasoning is made from the theorem that I mentioned in the appendix: If we aim to find a $\Omega$-sparse solution $c$, $X$ has better to have the NSP of order $\Omega$.
For a general $N \times M$ matrix, if $N > C \Omega \ln M$ is violated, then
no stable and robust recovery of $c$ from $D$ and $P$ is possible
$D$ corresponds to $X$, $P$ corresponds to $y$
…as expected from the $N = C \Omega \ln M$ relationship, the selection of the descriptor becomes more unstable, i.e., for different training sets, the selected descriptor often differs…
The second quote is the part that confuses me. It seems to me when the inequality is violated it is not just the solution maybe non-unique(not mentioned), but the descriptor will also become more unstable.
Best Answer
UPDATE
See this second post for McDonald's feedback on my answer where the notion of risk consistency is related to stability.
1) Uniqueness vs Stability
Your question is difficult to answer because it mentions two very different topics: uniqueness and stability.
Intuitively, a solution is unique if given a fixed data set, the algorithm always produces the same results. Martin's answer cover's this point in great detail.
Stability on the other hand can be intuitively understood as one for which the prediction does not change much when the training data is modified slightly.
Stability applies to your question because Lasso feature selection is (often) performed via Cross Validation, hence the Lasso algorithm is performed on different folds of data and may yield different results each time.
Stability and the No Free Lunch Theorem
Using the definition from here if we define Uniform stability as:
then the "No Free Lunch Theorem, Xu and Caramis (2012)" states that
For instance, $L_2$ regularized regression is stable and does not identify redundant features, while $L_1$ regularized regression (Lasso) is unstable.
An attempt at answering your question
Going further
This is not to say that the combination of Cross Validation and Lasso doesn't work... in fact it has been shown experimentally (and with much supporting theory) to work very well under various conditions. The main keywords here are consistency, risk, oracle inequalities etc..
The following slides and paper by McDonald and Homrighausen (2013) describe some conditions under which Lasso feature selection works well: slides and paper: "The lasso, persistence, and cross-validation, McDonald and Homrighausen (2013)". Tibshirani himself also posted an great set of notes on sparcity, linear regression
The various conditions for consistency and their impact on Lasso is an active topic of research and is definitely not a trivial question. I can point you towards some research papers which are relevant: