Solved – Under exactly what conditions is ridge regression able to provide an improvement over ordinary least squares regression

regressionregularizationridge regression

Ridge regression estimates parameters $\boldsymbol \beta$ in a linear model $\mathbf y = \mathbf X \boldsymbol \beta$ by $$\hat{\boldsymbol \beta}_\lambda = (\mathbf X^\top \mathbf X + \lambda \mathbf I)^{-1} \mathbf X^\top \mathbf y,$$ where $\lambda$ is a regularization parameter. It is well-known that it often performs better than the OLS regression (with $\lambda=0$) when there are many correlated predictors.

An existence theorem for ridge regression says that there always exists a parameter $\lambda^* > 0$ such that mean-squared-error of $\hat{\boldsymbol \beta}_\lambda$ is strictly smaller than mean-squared-error of the OLS estimation $\hat{\boldsymbol \beta}_\mathrm{OLS}=\hat{\boldsymbol \beta}_0$. In other words, an optimal value of $\lambda$ is always non-zero. This was apparently first proven in Hoerl and Kennard. (Hoerl, Arthur E., and Robert W. Kennard. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics, vol. 42, no. 1, [Taylor & Francis, Ltd., American Statistical Association, American Society for Quality], 2000, pp. 80–86, https://doi.org/10.2307/1271436.) It is repeated in many lecture notes that I find online (e.g. here and here). My question is about the assumptions of this theorem:

  1. Are there any assumptions about the covariance matrix $\mathbf X^\top \mathbf X$?

  2. Are there any assumptions about dimensionality of $\mathbf X$?

In particular, is the theorem still true if predictors are orthogonal (i.e. $\mathbf X^\top \mathbf X$ is diagonal), or even if $\mathbf X^\top \mathbf X=\mathbf I$? And is it still true if there is only one or two predictors (say, one predictor and an intercept)?

If the theorem makes no such assumptions and remains true even in these cases, then why is ridge regression usually recommended only in the case of correlated predictors, and never (?) recommended for simple (i.e. not multiple) regression?


This is related to my question about Unified view on shrinkage: what is the relation (if any) between Stein's paradox, ridge regression, and random effects in mixed models?, but no answers there clarify this point until now.

Best Answer

The answer to both 1 and 2 is no, but care is needed in interpreting the existence theorem.

Variance of Ridge Estimator

Let $\hat{\beta^*}$ be the ridge estimate under penalty $k$, and let $\beta$ be the true parameter for the model $Y = X \beta + \epsilon$. Let $\lambda_1, \dotsc, \lambda_p$ be the eigenvalues of $X^T X$.
From Hoerl & Kennard equations 4.2-4.5, the risk, (in terms of the expected $L^2$ norm of the error) is

$$ \begin{align*} E \left( \left[ \hat{\beta^*} - \beta \right]^T \left[ \hat{\beta^*} - \beta \right] \right)& = \sigma^2 \sum_{j=1}^p \lambda_j/ \left( \lambda_j +k \right)^2 + k^2 \beta^T \left( X^T X + k \mathbf{I}_p \right)^{-2} \beta \\ & = \gamma_1 (k) + \gamma_2(k) \\ & = R(k) \end{align*} $$ where as far as I can tell, $\left( X^T X + k \mathbf{I}_p \right)^{-2} = \left( X^T X + k \mathbf{I}_p \right)^{-1} \left( X^T X + k \mathbf{I}_p \right)^{-1}.$ They remark that $\gamma_1$ has the interpretation of the variance of the inner product of $\hat{\beta^*} - \beta$, while $\gamma_2$ is the inner product of the bias.

Supposing $X^T X = \mathbf{I}_p$, then $$R(k) = \frac{p \sigma^2 + k^2 \beta^T \beta}{(1+k)^2}.$$ Let $$R^\prime (k) = 2\frac{k(1+k)\beta^T \beta - (p\sigma^2 + k^2 \beta^T \beta)}{(1+k)^3}$$ be the derivative of the risk w/r/t $k$. Since $\lim_{k \rightarrow 0^+} R^\prime (k) = -2p \sigma^2 < 0$, we conclude that there is some $k^*>0$ such that $R(k^*)<R(0)$.

The authors remark that orthogonality is the best that you can hope for in terms of the risk at $k=0$, and that as the condition number of $X^T X$ increases, $\lim_{k \rightarrow 0^+} R^\prime (k)$ approaches $- \infty$.

Comment

There appears to be a paradox here, in that if $p=1$ and $X$ is constant, then we are just estimating the mean of a sequence of Normal$(\beta, \sigma^2)$ variables, and we know the the vanilla unbiased estimate is admissible in this case. This is resolved by noticing that the above reasoning merely provides that a minimizing value of $k$ exists for fixed $\beta^T \beta$. But for any $k$, we can make the risk explode by making $\beta^T \beta$ large, so this argument alone does not show admissibility for the ridge estimate.

Why is ridge regression usually recommended only in the case of correlated predictors?

H&K's risk derivation shows that if we think that $\beta ^T \beta$ is small, and if the design $X^T X$ is nearly-singular, then we can achieve large reductions in the risk of the estimate. I think ridge regression isn't used ubiquitously because the OLS estimate is a safe default, and that the invariance and unbiasedness properties are attractive. When it fails, it fails honestly--your covariance matrix explodes. There is also perhaps a philosophical/inferential point, that if your design is nearly singular, and you have observational data, then the interpretation of $\beta$ as giving changes in $E Y$ for unit changes in $X$ is suspect--the large covariance matrix is a symptom of that.

But if your goal is solely prediction, the inferential concerns no longer hold, and you have a strong argument for using some sort of shrinkage estimator.