Solved – Relationship between ridge regression and PCA regression

pcaregressionregularizationridge regression

I remember having read somewhere on the web a connection between ridge regression (with $\ell_2$ regularization) and PCA regression: while using $\ell_2$-regularized regression with hyperparameter $\lambda$, if $\lambda \to 0$, then the regression is equivalent to removing the PC variable with the smallest eigenvalue.

Why is this true?
Does this have anything to do with the optimization procedure? Naively, I would have expected it to be equivalent to OLS.
Does anybody have a reference for this?

Best Answer

Let $\mathbf X$ be the centered $n \times p$ predictor matrix and consider its singular value decomposition $\mathbf X = \mathbf{USV}^\top$ with $\mathbf S$ being a diagonal matrix with diagonal elements $s_i$.

The fitted values of ordinary least squares (OLS) regression are given by $$\hat {\mathbf y}_\mathrm{OLS} = \mathbf X \beta_\mathrm{OLS} = \mathbf X (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf y = \mathbf U \mathbf U^\top \mathbf y.$$ The fitted values of the ridge regression are given by $$\hat {\mathbf y}_\mathrm{ridge} = \mathbf X \beta_\mathrm{ridge} = \mathbf X (\mathbf X^\top \mathbf X + \lambda \mathbf I)^{-1} \mathbf X^\top \mathbf y = \mathbf U\: \mathrm{diag}\left\{\frac{s_i^2}{s_i^2+\lambda}\right\}\mathbf U^\top \mathbf y.$$ The fitted values of the PCA regression (PCR) with $k$ components are given by $$\hat {\mathbf y}_\mathrm{PCR} = \mathbf X_\mathrm{PCA} \beta_\mathrm{PCR} = \mathbf U\: \mathrm{diag}\left\{1,\ldots, 1, 0, \ldots 0\right\}\mathbf U^\top \mathbf y,$$ where there are $k$ ones followed by zeroes.

From here we can see that:

If $\lambda=0$ then $\hat {\mathbf y}_\mathrm{ridge} = \hat {\mathbf y}_\mathrm{OLS}$.
If $\lambda>0$ then the larger the singular value $s_i$, the less it will be penalized in ridge regression. Small singular values ($s_i^2 \approx \lambda$ and smaller) are penalized the most.
In contrast, in PCA regression, large singular values are kept intact, and the small ones (after certain number $k$) are completely removed. This would correspond to $\lambda=0$ for the first $k$ ones and $\lambda=\infty$ for the rest.
This means that ridge regression can be seen as a "smooth version" of PCR.

(This intuition is useful but does not always hold; e.g. if all $s_i$ are approximately equal, then ridge regression will only be able to penalize all principal components of $\mathbf X$ approximately equally and can strongly differ from PCR).
Ridge regression tends to perform better in practice (e.g. to have higher cross-validated performance).
Answering now your question specifically: if $\lambda \to 0$, then $\hat {\mathbf y}_\mathrm{ridge} \to \hat {\mathbf y}_\mathrm{OLS}$. I don't see how it can correspond to removing the smallest $s_i$. I think this is wrong.

One good reference is The Elements of Statistical Learning, Section 3.4.1 "Ridge regression".

See also this thread: Interpretation of ridge regularization in regression and in particular the answer by @BrianBorchers.

Best Answer

Related Solutions

Solved – Ridge & LASSO norms

Solved – Under exactly what conditions is ridge regression able to provide an improvement over ordinary least squares regression

Variance of Ridge Estimator

Comment

Why is ridge regression usually recommended only in the case of correlated predictors?

Related Question