Solved – Is ridge regression useless in high dimensions ($n \ll p$)? How to OLS fail to overfit

cross-validationoverfittingregularizationridge regression

Consider a good old regression problem with $p$ predictors and sample size $n$. The usual wisdom is that OLS estimator will overfit and will generally be outperformed by the ridge regression estimator: $$\hat\beta = (X^\top X + \lambda I)^{-1}X^\top y.$$ It is standard to use cross-validation to find an optimal regularization parameter $\lambda$. Here I use 10-fold CV. Clarification update: when $n<p$, by "OLS estimator" I understand "minimum-norm OLS estimator" given by $$\hat\beta_\text{OLS} = (X^\top X)^+X^\top y = X^+ y.$$

I have a dataset with $n=80$ and $p>1000$. All predictors are standardized, and there are quite a few that (alone) can do a good job in predicting $y$. If I randomly select a small-ish, say $p=50<n$, number of predictors, I get a reasonable CV curve: large values of $\lambda$ yield zero R-squared, small values of $\lambda$ yield negative R-squared (because of overfitting) and there is some maximum in between. For $p=100>n$ the curve looks similar. However, for $p$ much larger than that, e.g. $p=1000$, I do not get any maximum at all: the curve plateaus, meaning that OLS with $\lambda\to 0$ performs as good as ridge regression with optimal $\lambda$.

enter image description here

How is it possible and what does it say about my dataset? Am I missing something obvious or is it indeed counter-intuitive? How can there be any qualitative difference between $p=100$ and $p=1000$ given that both are larger than $n$?

Under what conditions does minimal-norm OLS solution for $n<p$ not overfit?


Update: There was some disbelief in the comments, so here is a reproducible example using glmnet. I use Python but R users will easily adapt the code.

%matplotlib notebook

import numpy as np
import pylab as plt
import seaborn as sns; sns.set()

import glmnet_python    # from https://web.stanford.edu/~hastie/glmnet_python/
from cvglmnet import cvglmnet; from cvglmnetPlot import cvglmnetPlot

# 80x1112 data table; first column is y, rest is X. All variables are standardized
mydata = np.loadtxt('../q328630.txt')   # file is here https://pastebin.com/raw/p1cCCYBR
y = mydata[:,:1]
X = mydata[:,1:]

# select p here (try 1000 and 100)
p = 1000

# randomly selecting p variables out of 1111
np.random.seed(42)
X = X[:, np.random.permutation(X.shape[1])[:p]]

fit = cvglmnet(x = X.copy(), y = y.copy(), alpha = 0, standardize = False, intr = False, 
               lambdau=np.array([.0001, .001, .01, .1, 1, 10, 100, 1000, 10000, 100000]))
cvglmnetPlot(fit)
plt.gcf().set_size_inches(6,3)
plt.tight_layout()

enter image description here
enter image description here

Best Answer

A natural regularization happens because of the presence of many small components in the theoretical PCA of $x$. These small components are implicitly used to fit the noise using small coefficients. When using minimum norm OLS, you fit the noise with many small independent components and this has a regularizing effect equivalent to Ridge regularization. This regularization is often too strong, and it is possible to compensate it using "anti-regularization" know as negative Ridge. In that case, you will see the minimum of the MSE curve appears for negative values of $\lambda$.

By theoretical PCA, I mean:

Let $x\sim N(0,\Sigma)$ a multivariate normal distribution. There is a linear isometry $f$ such as $u=f(x)\sim N(0,D)$ where $D$ is diagonal: the components of $u$ are independent. $D$ is simply obtained by diagonalizing $\Sigma$.

Now the model $y=\beta.x+\epsilon$ can be written $y=f(\beta).f(x)+\epsilon$ (a linear isometry preserves dot product). If you write $\gamma=f(\beta)$, the model can be written $y=\gamma.u+\epsilon$. Furthermore $\|\beta\|=\|\gamma\|$ hence fitting methods like Ridge or minimum norm OLS are perfectly isomorphic: the estimator of $y=\gamma.u+\epsilon$ is the image by $f$ of the estimator of $y=\beta.x+\epsilon$.

Theoretical PCA transforms non independent predictors into independent predictors. It is only loosely related to empirical PCA where you use the empirical covariance matrix (that differs a lot from the theoretical one with small sample size). Theoretical PCA is not practically computable but is only used here to interpret the model in an orthogonal predictor space.

Let's see what happens when we append many small variance independent predictors to a model:

Theorem

Ridge regularization with coefficient $\lambda$ is equivalent (when $p\rightarrow\infty$) to:

  • adding $p$ fake independent predictors (centred and identically distributed) each with variance $\frac{\lambda}{p}$
  • fitting the enriched model with minimum norm OLS estimator
  • keeping only the parameters for the true predictors

(sketch of) Proof

We are going to prove that the cost functions are asymptotically equal. Let's split the model into real and fake predictors: $y=\beta x+\beta'x'+\epsilon$. The cost function of Ridge (for the true predictors) can be written:

$$\mathrm{cost}_\lambda=\|\beta\|^2+\frac{1}{\lambda}\|y-X\beta\|^2$$

When using minimum norm OLS, the response is fitted perfectly: the error term is 0. The cost function is only about the norm of the parameters. It can be split into the true parameters and the fake ones:

$$\mathrm{cost}_{\lambda,p}=\|\beta\|^2+\inf\{\|\beta'\|^2 \mid X'\beta'=y-X\beta\}$$

In the right expression, the minimum norm solution is given by:

$$\beta'=X'^+(y-X\beta )$$

Now using SVD for $X'$:

$$X'=U\Sigma V$$

$$X'^{+}=V^\top\Sigma^{+} U^\top$$

We see that the norm of $\beta'$ essentially depends on the singular values of $X'^+$ that are the reciprocals of the singular values of $X'$. The normalized version of $X'$ is $\sqrt{p/\lambda} X'$. I've looked at literature and singular values of large random matrices are well known. For $p$ and $n$ large enough, minimum $s_\min$ and maximum $s_\max$ singular values are approximated by (see theorem 1.1):

$$s_\min(\sqrt{p/\lambda}X')\approx \sqrt p\left(1-\sqrt{n/p}\right)$$ $$s_\max(\sqrt{p/\lambda}X')\approx \sqrt p \left(1+\sqrt{n/p}\right)$$

Since, for large $p$, $\sqrt{n/p}$ tends towards 0, we can just say that all singular values are approximated by $\sqrt p$. Thus:

$$\|\beta'\|\approx\frac{1}{\sqrt\lambda}\|y-X\beta\|$$

Finally:

$$\mathrm{cost}_{\lambda,p}\approx\|\beta\|^2+\frac{1}{\lambda}\|y-X\beta\|^2=\mathrm{cost}_\lambda$$

Note: it does not matter if you keep the coefficients of the fake predictors in your model. The variance introduced by $\beta'x'$ is $\frac{\lambda}{p}\|\beta'\|^2\approx\frac{1}{p}\|y-X\beta\|^2\approx\frac{n}{p}MSE(\beta)$. Thus you increase your MSE by a factor $1+n/p$ only which tends towards 1 anyway. Somehow you don't need to treat the fake predictors differently than the real ones.

Now, back to @amoeba's data. After applying theoretical PCA to $x$ (assumed to be normal), $x$ is transformed by a linear isometry into a variable $u$ whose components are independent and sorted in decreasing variance order. The problem $y=\beta x+\epsilon$ is equivalent the transformed problem $y=\gamma u+\epsilon$.

Now imagine the variance of the components look like:

enter image description here

Consider many $p$ of the last components, call the sum of their variance $\lambda$. They each have a variance approximatively equal to $\lambda/p$ and are independent. They play the role of the fake predictors in the theorem.

This fact is clearer in @jonny's model: only the first component of theoretical PCA is correlated to $y$ (it is proportional $\overline{x}$) and has huge variance. All the other components (proportional to $x_i-\overline{x}$) have comparatively very small variance (write the covariance matrix and diagonalize it to see this) and play the role of fake predictors. I calculated that the regularization here corresponds (approx.) to prior $N(0,\frac{1}{p^2})$ on $\gamma_1$ while the true $\gamma_1^2=\frac{1}{p}$. This definitely over-shrinks. This is visible by the fact that the final MSE is much larger than the ideal MSE. The regularization effect is too strong.

It is sometimes possible to improve this natural regularization by Ridge. First you sometimes need $p$ in the theorem really big (1000, 10000...) to seriously rival Ridge and the finiteness of $p$ is like an imprecision. But it also shows that Ridge is an additional regularization over a naturally existing implicit regularization and can thus have only a very small effect. Sometimes this natural regularization is already too strong and Ridge may not even be an improvement. More than this, it is better to use anti-regularization: Ridge with negative coefficient. This shows MSE for @jonny's model ($p=1000$), using $\lambda\in\mathbb{R}$:

enter image description here