Solved – Why is “relaxed lasso” different from standard lasso

lassooptimizationregressionregularization

If we start with a set of data $(X,Y)$, apply Lasso to it and obtain a solution $\beta^L$, we can apply Lasso again to the data set $(X_S, Y)$, where $S$ is the set of non-zero indexes of $\beta^L$, to obtain a solution, $\beta^{RL}$, called 'relaxed LASSO' solution (correct me if I'm wrong!). The solution $\beta^L$ must satisfy the Karush–Kuhn–Tucker (KKT) conditions for $(X,Y)$ but, given the form of the KKT conditions for $(X_S, Y)$, does not it also satisfy these? If so, what is the point of doing LASSO a second time?

This question is a follow up to: Advantages of doing "double lasso" or performing lasso twice?

Best Answer

From definition 1 of Meinshausen(2007), there are two parameters controlling the solution of the relaxed Lasso.

The first one, $\lambda$, controls the variable selection, whereas the second, $\phi$, controls the shrinkage level. When $\phi= 1$ both Lasso and relaxed-Lasso are the same (as you said!), but for $\phi<1$ you obtain a solution with coefficients closer to what would give an orthogonal projection on the selected variables (kind of soft de-biasing).

This formulation actually corresponds to solve two problems:

First the full Lasso with penalization parameter $\lambda$
Second the Lasso on $X_S$, which is $X$ reduced to variables selected by 1, with a penalization parameter $\lambda\phi$.

Related Solutions

Solved – How to find parameters for ridge and lasso regularization when cost minimization does not converge

Is the question "why is cost() so flat near [0 0 0 0] ?",
or "why does fmin_xx not find a minimum on a flat surface ?"
A couple of suggestions anyway:

1) look at cost() near x0, with look( f, x0, h ) -> f() at all corners of a cube of side 2h around x0, or if that's too many at a random subset.
2) is [0 0 0 0] a reasonable start point for weights ?
3) what happened to the first 3, continuous, columns of X ?
4) start with fmin() a.k.a. Nelder-Mead (at the best dim + 1 points from look()) before running fmin_xx -- more powerful but harder to drive.
5) scikit-learn SGDClassifier got to

97.3 % correct  SGDClassifier  uciml/ad*
    X (1572, 1558) Xtest (787, 1558)  -- 2/3, 1/3 split
    centre 3  -- scale all feature columns to [-1, 1]
    sgditer 100  loss log  penalty l2
11 sec
Confusion matrix: 97.3 % correct = 766 / 787
True classes down, estimated across  / true class sizes
0:  652    8  /  660  99 %
1:   13  114  /  127  90 %

Solved – Advantages of doing “double lasso” or performing lasso twice

Yes, the procedure you are asking (or thinking of) is called the relaxed lasso.

The general idea is that in the process of performing the LASSO for the first time you are probably including "noise variables"; performing the LASSO on a second set of variables (after the first LASSO) gives less competition between variables that are "real competitors" to being part of the model and not just "noise" variables. Technically, what this methods aims to is to overcome the (known) slow convergence of the LASSO in datasets with large number of variables.

You can read more about it on the original paper by Meinshausen (2007).

I also recommend section 3.8.5 on the Elements of Statistical Learning (Hastie, Tibshirani & Friedman, 2008), which gives an overview of other very interesting methods for performing variable selection using the LASSO.

Best Answer

Related Solutions

Solved – How to find parameters for ridge and lasso regularization when cost minimization does not converge

Solved – Advantages of doing “double lasso” or performing lasso twice

Related Question