Solved – Why use Lasso estimates over OLS estimates on the Lasso-identified subset of variables

feature selectionlassoregressionregularization

For Lasso regression $$L(\beta)=(X\beta-y)'(X\beta-y)+\lambda\|\beta\|_1,$$ suppose the best solution (minimum testing error for example) selects $k$ features, so that $\hat{\beta}^{lasso}=\left(\hat{\beta}_1^{lasso},\hat{\beta}_2^{lasso},…,\hat{\beta}_k^{lasso},0,…0\right)$.

We know that $\left(\hat{\beta}_1^{lasso},\hat{\beta}_2^{lasso},…,\hat{\beta}_k^{lasso}\right)$ is a biased estimate of $\left(\beta_1,\beta_2,…,\beta_k\right)$, so why do we still take $\hat{\beta}^{lasso}$ as the final solution, instead of the more 'reasonable' $\hat{\beta}^{new}=\left(\hat{\beta}_{1:k}^{new},0,…,0\right)$, where $\hat{\beta}_{1:k}^{new}$ is the LS estimate from partial model $L^{new}(\beta_{1:k})=(X_{1:k}\beta-y)'(X_{1:k}\beta-y)$. ($X_{1:k}$ denotes the columns of $X$ corresponding to the $k$ selected features).

In brief, why do we use Lasso both for feature selection and for parameter estimation, instead of only for variable selection (and leaving the estimation on the selected features to OLS)?

(Also, what does it mean that 'Lasso can select at most $n$ features'? $n$ is the sample size.)

Best Answer

I don't believe there is anything wrong with using LASSO for variable selection and then using OLS. From "Elements of Statistical Learning" (pg. 91)

...the lasso shrinkage causes the estimates of the non-zero coefficients to be biased towards zero and in general they are not consistent [Added Note: This means that, as the sample size grows, the coefficient estimates do not converge]. One approach for reducing this bias is to run the lasso to identify the set of non-zero coefficients, and then fit an un-restricted linear model to the selected set of features. This is not always feasible, if the selected set is large. Alternatively, one can use the lasso to select the set of non-zero predictors, and then apply the lasso again, but using only the selected predictors from the first step. This is known as the relaxed lasso (Meinshausen, 2007). The idea is to use cross-validation to estimate the initial penalty parameter for the lasso, and then again for a second penalty parameter applied to the selected set of predictors. Since the variables in the second step have less "competition" from noise variables, cross-validation will tend to pick a smaller value for $\lambda$ [the penalty parameter], and hence their coefficients will be shrunken less than those in the initial estimate.

Another reasonable approach similar in spirit to the relaxed lasso, would be to use lasso once (or several times in tandem) to identify a group of candidate predictor variables. Then use best subsets regression to select the best predictor variables to consider (also see "Elements of Statistical Learning" for this). For this to work, you would need to refine the group of candidate predictors down to around 35, which won't always be feasible. You can use cross-validation or AIC as a criterion to prevent over-fitting.

Related Question