Solved – Consistency of lasso

asymptoticsconsistencylassoself-study

I would appreciate help in understanding the following theorem from Knight and Fu (2002) paper:

Consider linear regression model of the form
$$Y_i = \beta_0 + x_i'\beta + \varepsilon_i,$$
where $\varepsilon_i \sim i.i.d.(0, \sigma^2)$. We estimate the model using the lasso type estimator $\hat\beta$
$$\hat\beta = \text{arg min}_\phi \sum_{i=1}^{n}(Y_i – x_i'\phi)^2 + \lambda_n\sum_{j=1}^p\vert\phi_j\vert .$$

Further assume that $C_n = n^{-1}\sum_{i=1}^nx_ix_i' \to C$, where $C$ is nonsingular. Let $\lambda_n/n \to \lambda_0 \geq 0$. Then $\hat\beta \xrightarrow{p} \text{arg min}(Z)$ where
$$Z(\phi) = (\phi – \beta)'C(\phi – \beta) + \lambda_0\sum_{j=1}^p\vert \phi_j\vert. $$

Thus if $\lambda_n = o(n)$, $argmin(Z) = \beta$ and so $\hat\beta$ is consistent.

Could you explain how does consistency follow from the fact that $\hat\beta \xrightarrow{p} \text{arg min}(Z)$. And also do we talk about consistency in parameter estimation or other types of consistency (e.g. oracle properties)?

Best Answer

You'll be disappointed to find that the consistency that matters the most with lasso is the consistency about which predictors are chosen. If you simulate two moderately large datasets and perform lasso independently and compare the results, the low degree of overlap will reveal the difficulty of the task in selecting features. This is even more true when co-linearities are present. lasso spends too much of its energy on feature selection intead of estimation, and the L1 norm results in too much shrinkage of truly important predictors (hence the popularity of the horseshoe prior in Bayesian high-dimensional modeling). I wouldn't be too interested in the type of consistency you described above until these more fundamental issues are addressed. I discuss these issues in general, and show how the bootstrap can help uncover them, here in the chapter on challenges of high-dimensional data analysis.

Related Question