Solved – Maximum penalty for ridge regression

cross-validationregressionregularizationridge regression

Consider a regression model

$$ y = X \beta + \varepsilon. $$

I will use ridge regression to estimate $\beta$. Ridge regression contains a tuning parameter (the penalty intensity) $\lambda$. If I were given a grid of candidate $\lambda$ values, I would use cross validation to select the optimal $\lambda$. However, the grid is not given, so I need to design it first. For that I need to choose, among other things, a maximum value $\lambda_{max}$.

Question: How do I sensibly choose $\lambda_{max}$ in ridge regression?

There needs to be a balance between

  • a $\lambda_{max}$ that is "too large", leading to wasteful computations when evaluating the performance of (possibly many) models that are penalized too harshly;
  • a $\lambda_{max}$ that is "too small" leading to a forgone opportunity to penalize more intensely and get better performance.

(Note that the answer is simple in the case of LASSO; there you take $\lambda_{max}$ such that all coefficients are set exactly to zero for any $\lambda \geq \lambda_{max}$.)

Best Answer

The effect of $\lambda$ in the ridge regression estimator is that it "inflates" singular values $s_i$ of $X$ via terms like $(s^2_i+\lambda)/s_i$. Specifically, if SVD of the design matrix is $X=USV^\top$, then $$\hat\beta_\mathrm{ridge} = V^\top \frac{S}{S^2+\lambda I} U y.$$ This is explained multiple times on our website, see e.g. @whuber's detailed exposition here: The proof of shrinking coefficients using ridge regression through "spectral decomposition".

This suggests that selecting $\lambda$ much larger than $s_\mathrm{max}^2$ will shrink everything very strongly. I suspect that $$\lambda=\|X\|_2^2=\sum s_i^2$$ will be too big for all practical purposes.

I usually normalize my lambdas by the squared Frobenius norm of $X$ and have a cross-validation grid that goes from $0$ to $1$ (on a log scale).


Having said that, no value of lambda can be seen as truly "maximum", in contrast to the lasso case. Imagine that predictors are exactly orthogonal to the response, i.e. that the true $\beta=0$. Any finite value of $\lambda<\infty $ for any finite value of sample size $n$ will yield $\hat \beta \ne 0$ and hence could benefit from stronger shrinkage.