For Lasso regression $$L(\beta)=(X\beta-y)'(X\beta-y)+\lambda\|\beta\|_1,$$ suppose the best solution (minimum testing error for example) selects $k$ features, so that $\hat{\beta}^{lasso}=\left(\hat{\beta}_1^{lasso},\hat{\beta}_2^{lasso},…,\hat{\beta}_k^{lasso},0,…0\right)$.
We know that $\left(\hat{\beta}_1^{lasso},\hat{\beta}_2^{lasso},…,\hat{\beta}_k^{lasso}\right)$ is a biased estimate of $\left(\beta_1,\beta_2,…,\beta_k\right)$, so why do we still take $\hat{\beta}^{lasso}$ as the final solution, instead of the more 'reasonable' $\hat{\beta}^{new}=\left(\hat{\beta}_{1:k}^{new},0,…,0\right)$, where $\hat{\beta}_{1:k}^{new}$ is the LS estimate from partial model $L^{new}(\beta_{1:k})=(X_{1:k}\beta-y)'(X_{1:k}\beta-y)$. ($X_{1:k}$ denotes the columns of $X$ corresponding to the $k$ selected features).
In brief, why do we use Lasso both for feature selection and for parameter estimation, instead of only for variable selection (and leaving the estimation on the selected features to OLS)?
(Also, what does it mean that 'Lasso can select at most $n$ features'? $n$ is the sample size.)
Best Answer
I don't believe there is anything wrong with using LASSO for variable selection and then using OLS. From "Elements of Statistical Learning" (pg. 91)
Another reasonable approach similar in spirit to the relaxed lasso, would be to use lasso once (or several times in tandem) to identify a group of candidate predictor variables. Then use best subsets regression to select the best predictor variables to consider (also see "Elements of Statistical Learning" for this). For this to work, you would need to refine the group of candidate predictors down to around 35, which won't always be feasible. You can use cross-validation or AIC as a criterion to prevent over-fitting.