Solved – Soft-thresholding vs. Lasso penalization

feature selectiongeneticslassomultivariate analysis

I am trying to summarize what I understood so far in penalized multivariate analysis with high-dimensional data sets, and I still struggle through getting a proper definition of soft-thresholding vs. Lasso (or $L_1$) penalization.

More precisely, I used sparse PLS regression to analyze 2-block data structure including genomic data (single nucleotide polymorphisms, where we consider the frequency of the minor allele in the range {0,1,2}, considered as a numerical variable) and continuous phenotypes (scores quantifying personality traits or cerebral asymmetry, also treated as continuous variables). The idea was to isolate the most influential predictors (here, the genetic variations on the DNA sequence) to explain inter-individual phenotypic variations.

I initially used the mixOmics R package (formerly integrOmics) which features penalized PLS regression and regularized CCA. Looking at the R code, we found that the "sparsity" in the predictors is simply induced by selecting the top $k$ variables with highest loadings (in absolute value) on the $i$th component, $i=1,\dots, k$ (the algorithm is iterative and compute variables loadings on $k$ components, deflating the predictors block at each iteration, see Sparse PLS: Variable Selection when Integrating Omics data for an overview).
On the contrary, the spls package co-authored by S. Keleş (see Sparse Partial Least Squares Regression for Simultaneous Dimension Reduction and Variable Selection, for a more formal description of the approach undertaken by these authors) implements $L_1$-penalization for variable penalization.

It is not obvious to me whether there is a strict "bijection", so to say, between iterative feature selection based on soft-thresholding and $L_1$ regularization. So my question is: Is there any mathematical connection between the two?

References

  1. Chun, H. and Kele ̧s, S. (2010), Sparse partial least squares for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society: Series B, 72, 3–25.
  2. Le Cao, K.-A., Rossouw, D., Robert-Granie, C., and Besse, P. (2008), A Sparse PLS for Variable Selection when Integrating Omics Data. Statistical Applications in Genetics and Molecular Biology, 7, Article 35.

Best Answer

What i'll say holds for regression, but should be true for PLS also. So it's not a bijection because depeding on how much you enforce the constrained in the $l1$, you will have a variety of 'answers' while the second solution admits only $p$ possible answers (where $p$ is the number of variables) <-> there are more solutions in the $l1$ formulation than in the 'truncation' formulation.

Related Question