Solved – Relationship between ridge regression and PCA regression

pcaregressionregularizationridge regression

I remember having read somewhere on the web a connection between ridge regression (with $\ell_2$ regularization) and PCA regression: while using $\ell_2$-regularized regression with hyperparameter $\lambda$, if $\lambda \to 0$, then the regression is equivalent to removing the PC variable with the smallest eigenvalue.

  • Why is this true?
  • Does this have anything to do with the optimization procedure? Naively, I would have expected it to be equivalent to OLS.
  • Does anybody have a reference for this?

Best Answer

Let $\mathbf X$ be the centered $n \times p$ predictor matrix and consider its singular value decomposition $\mathbf X = \mathbf{USV}^\top$ with $\mathbf S$ being a diagonal matrix with diagonal elements $s_i$.

The fitted values of ordinary least squares (OLS) regression are given by $$\hat {\mathbf y}_\mathrm{OLS} = \mathbf X \beta_\mathrm{OLS} = \mathbf X (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf y = \mathbf U \mathbf U^\top \mathbf y.$$ The fitted values of the ridge regression are given by $$\hat {\mathbf y}_\mathrm{ridge} = \mathbf X \beta_\mathrm{ridge} = \mathbf X (\mathbf X^\top \mathbf X + \lambda \mathbf I)^{-1} \mathbf X^\top \mathbf y = \mathbf U\: \mathrm{diag}\left\{\frac{s_i^2}{s_i^2+\lambda}\right\}\mathbf U^\top \mathbf y.$$ The fitted values of the PCA regression (PCR) with $k$ components are given by $$\hat {\mathbf y}_\mathrm{PCR} = \mathbf X_\mathrm{PCA} \beta_\mathrm{PCR} = \mathbf U\: \mathrm{diag}\left\{1,\ldots, 1, 0, \ldots 0\right\}\mathbf U^\top \mathbf y,$$ where there are $k$ ones followed by zeroes.

From here we can see that:

  1. If $\lambda=0$ then $\hat {\mathbf y}_\mathrm{ridge} = \hat {\mathbf y}_\mathrm{OLS}$.

  2. If $\lambda>0$ then the larger the singular value $s_i$, the less it will be penalized in ridge regression. Small singular values ($s_i^2 \approx \lambda$ and smaller) are penalized the most.

  3. In contrast, in PCA regression, large singular values are kept intact, and the small ones (after certain number $k$) are completely removed. This would correspond to $\lambda=0$ for the first $k$ ones and $\lambda=\infty$ for the rest.

  4. This means that ridge regression can be seen as a "smooth version" of PCR.

    (This intuition is useful but does not always hold; e.g. if all $s_i$ are approximately equal, then ridge regression will only be able to penalize all principal components of $\mathbf X$ approximately equally and can strongly differ from PCR).

  5. Ridge regression tends to perform better in practice (e.g. to have higher cross-validated performance).

  6. Answering now your question specifically: if $\lambda \to 0$, then $\hat {\mathbf y}_\mathrm{ridge} \to \hat {\mathbf y}_\mathrm{OLS}$. I don't see how it can correspond to removing the smallest $s_i$. I think this is wrong.

One good reference is The Elements of Statistical Learning, Section 3.4.1 "Ridge regression".


See also this thread: Interpretation of ridge regularization in regression and in particular the answer by @BrianBorchers.