Feature Selection – Causes of Instability in Lasso for Feature Selection

feature selectionlassoregressionregularizationself-study

In compressed sensing, there is a theorem guarantee that
$$\text{argmin} \Vert c \Vert_1\\
\text{subject to } y = Xc
$$
has a unique sparse solution $c$ (See appendix for more details).

Is there a similar theorem for lasso? If there is such a theorem, not only will it guarantee the stability of lasso, but it will also provide lasso with a more meaningful interpretation:

lasso can uncover the sparse regression coefficient vector $c$ that is used to generate the response $y$ by $y = Xc$.

There are two reasons that I ask this question:

  1. I think 'lasso favors a sparse solution' is not an answer to why use lasso for feature selection since we can't even tell what the advantage of the features we select is.

  2. I learned lasso is notorious for being unstable for feature selection. In practice, we have to run bootstrap samples to evaluate its stability. What is the most crucial reason that causes this instability?


Appendix:

Given $X_{N \times M} = (x_1, \cdots, x_M)$. $c$ is a $\Omega$-sparse vector ($\Omega \leqslant M$). The process $y = Xc$ generates the response $y$. If $X$ has the NSP (null space property) of order $\Omega$ and covariance matrix of $X$ has no eigenvalue close to zero, there will be a unique solution to
$$\text{argmin} \Vert c \Vert_1\\
\text{subject to } y = Xc
$$
which is exactly the $c$ that gives $y$.

What this theorem also tells is also if $X$ has not the NSP of order $\Omega$, it is simply hopeless to solve $\text{argmin}_{c: y = Xc} \Vert c \Vert_1$.


EDIT:

After receiving these great answers, I realized I was confused when I was asking this question.

Why this question is confusing:

I read a research paper in which we have to decide how many features (columns) the design matrix $X_{N \times M}$ is going to have (auxiliary features are created from primary features). Since it is a typical $n < p$ problem, $D$ is expected to be well constructed so that the solution to lasso can be a good approximation of the real sparse solution.

The reasoning is made from the theorem that I mentioned in the appendix: If we aim to find a $\Omega$-sparse solution $c$, $X$ has better to have the NSP of order $\Omega$.

For a general $N \times M$ matrix, if $N > C \Omega \ln M$ is violated, then

no stable and robust recovery of $c$ from $D$ and $P$ is possible

$D$ corresponds to $X$, $P$ corresponds to $y$

…as expected from the $N = C \Omega \ln M$ relationship, the selection of the descriptor becomes more unstable, i.e., for different training sets, the selected descriptor often differs…

The second quote is the part that confuses me. It seems to me when the inequality is violated it is not just the solution maybe non-unique(not mentioned), but the descriptor will also become more unstable.

Best Answer

UPDATE

See this second post for McDonald's feedback on my answer where the notion of risk consistency is related to stability.


1) Uniqueness vs Stability

Your question is difficult to answer because it mentions two very different topics: uniqueness and stability.

  • Intuitively, a solution is unique if given a fixed data set, the algorithm always produces the same results. Martin's answer cover's this point in great detail.

  • Stability on the other hand can be intuitively understood as one for which the prediction does not change much when the training data is modified slightly.

Stability applies to your question because Lasso feature selection is (often) performed via Cross Validation, hence the Lasso algorithm is performed on different folds of data and may yield different results each time.

Stability and the No Free Lunch Theorem

Using the definition from here if we define Uniform stability as:

An algorithm has uniform stability $\beta$ with respect to the loss function $V$ if the following holds:

$$\forall S \in Z^m \ \ \forall i \in \{ 1,...,m\}, \ \ \sup | > V(f_s,z) - V(f_{S^{|i},z}) |\ \ \leq \beta$$

Considered as a function of $m$, the term $\beta$ can be written as $\beta_m$. We say the algorithm is stable when $\beta_m$ decreases as $\frac{1}{m}$.

then the "No Free Lunch Theorem, Xu and Caramis (2012)" states that

If an algorithm is sparse, in the sense that it identifies redundant features, then that algorithm is not stable (and the uniform stability bound $\beta$ does not go to zero). [...] If an algorithm is stable, then there is no hope that it will be sparse. (pages 3 and 4)

For instance, $L_2$ regularized regression is stable and does not identify redundant features, while $L_1$ regularized regression (Lasso) is unstable.

An attempt at answering your question

I think 'lasso favors a sparse solution' is not an answer to why use lasso for feature selection

  • I disagree, the reason Lasso is used for feature selection is that it yields a sparse solution and can be shown to have the IRF property, i.e. Identifies Redundant Features.

What is the most crucial reason that causes this instability

  • The No Free Lunch Theorem

Going further

This is not to say that the combination of Cross Validation and Lasso doesn't work... in fact it has been shown experimentally (and with much supporting theory) to work very well under various conditions. The main keywords here are consistency, risk, oracle inequalities etc..

The following slides and paper by McDonald and Homrighausen (2013) describe some conditions under which Lasso feature selection works well: slides and paper: "The lasso, persistence, and cross-validation, McDonald and Homrighausen (2013)". Tibshirani himself also posted an great set of notes on sparcity, linear regression

The various conditions for consistency and their impact on Lasso is an active topic of research and is definitely not a trivial question. I can point you towards some research papers which are relevant:

Related Question