Solved – Lasso and Ridge tuning parameter scope

cross-validationlassoregressionregularizationridge regression

In ridge and lasso linear regression, an important step is to choose the tuning parameter lambda, often I use grid search on log scale from -6->4, it works well on ridge, but on lasso, should I take into account the order of magnitude of output y ? for example, if output y is in nano scale (-9), my search scope for log lambda may be -15 -> -5.

all the input parameters are normalized, they're inside -3,3

Best Answer

Yes, you should taking into account the scale of the output $y$ and should also take into account the scale of the covariates in $X$.

Let $X \in \mathbb{R}^{n \times p}$ be the design matrix, whose rows are vectors with each entry being a covariate that together seek to explain the response $y \in \mathbb{R}^n$. Each entry of the response $y_i = f(e_i^T X) + \epsilon_i$ (for $i = 1, \dots, n$) is additively composed of a signal that depends on the covariates and an iid mean zero noise. Choosing to model the signal $f$ as being approximately linear leads us to the LASSO estimate $$\hat \beta_\lambda = \arg\min_\beta \frac{1}{2n} \|y-X\beta\|_2^2 + \lambda \|\beta\|_1,$$ we know, by first order conditions, that $\frac{-1}{n} X^T (y - X \hat \beta_\lambda) = \lambda \hat{z}_\lambda$, where $\hat{z}_\lambda$ is the dual variable satisfying $\hat{z}_{\lambda,j} = sgn(\hat{\beta}_{\lambda, j})$ if $\hat{\beta}_{\lambda, j} \neq 0$ and $\hat{z}_{\lambda, j} \in [-1,1]$ if $\hat{\beta}_{\lambda, j} = 0$.

Plugging in $\hat{\beta}_\lambda = 0$ into this equation, we see that $\frac{-1}{n} X^T y = \lambda \hat{z}_\lambda$, making $$\frac{1}{n} \|X^T y \|_\infty = \lambda \|\hat{z}_{\lambda}\|_\infty.$$

If $\|\hat{z}_\lambda\|_\infty \neq 1$, then $\lambda$ could decrease (with $\|\hat{z}_\lambda\|_\infty$ increased to maintain equality) and the LASSO estimate would still be $\hat{\beta}_\lambda = 0$. Therefore, at $\lambda_\mathrm{max}$, the smallest value of $\lambda$ that produces $\hat{\beta}_{\lambda}=0$, we get that $$\frac{1}{n} \|X^T y\|_\infty = \lambda_\mathrm{max} \cdot 1.$$

This tells us that there's no need to consider $\lambda > \lambda_\mathrm{max}$ when tuning the LASSO. Now, in practice, most solvers standardize the columns of $X$ so that won't need to be directly taken into account. (Note that it's reasonable to standardize the covariates since the units of measurement shouldn't affect the estimated coefficient.)

The ridge case is discussed well here: Maximum penalty for ridge regression