Solved – How to explain the “bad reputation” of higher-order polynomials

overfittingpolynomialregression

We all must have heard it by now – when we start learning about statistical models overfitting data, the first example we are often given is about "polynomial functions" (e.g., see the picture here):

Enter image description here

We are warned that although higher-degree polynomials can fit training data quite well, they surely will overfit and generalize poorly to the test data.

Why does this happen? Is there a mathematical justification as to why (higher-degree) polynomial functions overfit the data? The closest explanation I could find online was something called "Runge's phenomenon", which suggests that higher-order polynomials tend to "oscillate" a lot – does this explain why polynomial functions are known to overfit data?

I understand that there is a whole field of "regularization" that tries to fix these overfitting problems (e.g., penalization can prevent a statistical model from "hugging" the data too closely) – but just using mathematical intuition, why are polynomials known to overfit the data?

In general, "functions" (e.g., the response variable you are trying to predict using machine learning algorithms) can be approximated using older methods like Fourier series, Taylor series and newer methods like neural networks. I believe that there are theorems that guarantee that Taylor series, polynomials and neural networks can "arbitrarily approximate" any function. Perhaps neural networks can promise smaller errors for simpler complexity?

But are there mathematical reasons behind higher-order polynomials (e.g., polynomial regression) being said to have a bad habit of overfitting, to the extent that they have become very unpopular? Is this solely explainable by Runge's phenomenon?

Reference:

Gelman, A. and Imbens, G. (2019) Why high order polynomials should not be used in regression discontinuity designs. Journal of Business and Economic Statistics 37(3), pp. 447-456. (An NBER working paper version is available here)

Best Answer

High degree polynomials do not overfit the data

This is a common misconception which is nonetheless found in many textbooks. In general, in order to specify a statistical model, it is necessary to specify both a hypothesis class and a fitting procedure. In order to define the Variance of the model ("variance" here in the sense of Bias-Variance Tradeoff), it is necessary to know how to fit the model, so claiming that a hypothesis class (like polynomials) can overfit the data is simply a category error.

To illustrate this point, consider that the Lagrange interpolation method is only one of several ways to fit a polynomial to a collection of points. Another method is to use Bernstein polynomials. Given $n$ observation points $x_i=i/n$ and $y_i=f(x_i)+\epsilon_i$, the Bernstein approximant is, like the Lagrange approximant, a polynomial of degree $n-1$ (i.e., as many degrees of freedom as there are data points). However, we can see that the fitted Bernstein polynomials do not exhibit the wild oscillations that the Lagrange polynomials do. Even more strikingly, the variance of the Bernstein fits actually decreases as we add more points (and thus increase the degree of the polynomials).

As you can see, the Bernstein polynomial does not pass through each point. This reflects the fact that it uses a different fitting procedure than the Lagrange polynomial. But both models have the exact same hypothesis class (the set of degree $n-1$ polynomials), which underscores that it is incoherent to talk about statistical properties of a hypothesis class without also specifying the loss function and fitting procedure.

enter image description here

enter image description here

As a second example, it is also possible to exactly interpolate the training data using a polynomial without wild oscillations, assuming we are working over complex numbers. If the observation points $x_i$ lie along the unit circle $x_i=e^{2\pi i\sqrt{-1}/N}$, then fitting a polynomial $y_i=\sum_d c_dx_i^d$ basically comes down to computing the inverse Fourier transform of $y_i$. Below I plot the complex polynomial interpolant as a function of the complex argument $i/N$. Moreover, it turns out that the variance of this interpolant is constant as a function of $n$.

enter image description here

Code for these simulations can be found at the end of this answer.

In summary, It is simply false that including higher degree polynomials will increase the variance of the model, or make it more prone to overfit

As a consequence, any explanation of why Lagrange interpolants do in fact overfit will need to use the actual structure of the model and fitting procedure, and cannot be based on general parameter-counting arguments.

Fortunately, it is not too hard to mathematically show why Lagrange interpolants do exhibit this overfitting. What is basically comes down to is that fitting the Lagrange interpolant requires inversion of a Vandermonde matrix, which are often ill-conditioned.

More formally, what we want to do is to show that the variance of the Lagrange interpolants grows very quickly as a function of $n$, the number of data points.

Assume we are given some fixed $x_i$ points, as well as $y$ values sampled from $y_i\sim f(x_i)+\epsilon_i$ where $f$ is some smooth function and $\epsilon$ are iid samples from a unit normal distribution. The Lagrange interpolation then takes the form $\hat{f}_y(x)=\sum_i l_i(x)y_i$ where $l_i$ are the Lagrange polynomials of the $x_i$. We want to calculate $Var_y(\hat{f}(x))$, where the variance is with respect to different sampled values of the $y$s. By independence of the $\epsilon_i$s, we have

$$Var_y(\hat{f}(x))=Var_y(\sum_i l_i(x)(f(x_i)+\epsilon_i))=\sum_i l_i(x)^2$$

Now, consider for simplicity the case where the observation points are $x_1,\dots, x_n=1,\dots, n$. In general, the variance $Var(\hat{f}(x))$ is a function of the location $x$, but we will consider the integrated variance $Var=\int_0^nVar(\hat{f}(x))dx$.

Claim: $Var$ increases (at least) like $O(e^{an}n^{-9/2})$ for some $a>0$.

In general, we have the formula $l_i(x)={\frac {l(x)}{l'(x_i)(x-x_i)}}$ where $l(x):=\prod_i (x-x_i)$. We can compute the denominator: $$l'(x_j)=\prod_{i\neq j}(j-i)=(j-1)(j-2)...(1)(-1)...(-(n-j))=(-1)^{n-j}(j-1)! (n-j)!$$

So the variance at $x$ is $l(x)^2 \sum_j {\frac 1 {(j-1)!^2(n-j)!^2(x-x_j)^2}}$. Now, $|x-x_j|<=n/2$, so ${\frac 1 {(x-x_j)^2}}\geq 4/n^2$

\begin{eqnarray*} \sum_j {\frac 1 {(j-1)!^2(n-j)!^2}}{\frac 1 {(x-x_j)^2}}&\geq &4n^{-2}n!^{-2}\sum_j {n\choose j}^2\\ & = &4n^{-2} n!^{-2}{2n\choose n}\\ &\sim& n^{-7/2} 4^n(e/n)^{2n} \end{eqnarray*}

where we used Stirling's formula in the last line.

So in order to get a lower bound $\int_0^n Var(\hat{f}(x))dx$, we need to integrate $l(x)^2$. Noting that $l(x)^2=n^{2n}\prod_i(x'-i/n)^2$, where $x'=x/n$, we have

$$l(x)^2=n^{2n}\prod_i (x'-i/n)^2=n^{2n}e^{2\sum_i \log |x'-i/n|}\sim n^{2n}e^{2n\int_0^1 \log |x'-y|dy}=n^{2n}e^{2n(-H(x')-1)}$$ where $H$ is the binary entropy function.So,

$$\int_0^n l(x)^2x\sim n^{2n-1}\int_0^1 e^{2n(-H(x')-1)} dx'\geq n^{2n-1}e^{-2C(\epsilon)n}\int_{1-\epsilon}^1 dx'=n^{2n-1}e^{-2Cn}/\epsilon$$

where $C(\epsilon)=1+H(1-\epsilon)$ and $0<\epsilon<1/2$ is arbitrary.

The integrated variance $\int_0^n Var(\hat{f}(x))dx$ is thus at least $O(e^{-2Cn}n^{2n-9/2}4^n(e/n)^{2n})=O(n^{-9/2}({\frac {2e}{e^C}})^{2n})$

Now crucially, $C(\epsilon)<1+\log(2)$, because the binary entropy function is maximized at $1/2$. Therefore, $2ee^{-C}>1$, so we get the claimed form for the variance.

Edit: analysis of Bernstein variance As an interesting comparison, it is not too hard to work out the variance for the Bernstein approximator. This polynomial is defined by $\hat{B}(x)=\sum_i b_{i,n}(x)y_i$, where $b_{i,n}$ are the Bernstein basis polynomials, as defined in the Wikipedia link. Arguing similarly as before, the variance with respect to different samplings of $y$ is $\sum_i b_{i,n}(x)^2$. As in the simulation, I'll average over $x\in [0,1]$. Evidently $\int_0^1 b_{i,n}(x)^2dx={n\choose i}^2B(2i+1,2n-2i+1)$, with $B$ the beta function. We have to sum this over $i$. I claim the sum comes to: $$\int_0^1 Var(\hat{B}(x))dx={\frac {4^n}{(2n+1){2n\choose n}}}$$

(proof is below)

This formula does indeed closely match the empirical results from the simulation (note that for direct comparison, you will have to multiply by $.1^2$, which was the noise variance used in the simulation). By Stirling's formula, the variance of the Bernstein approximator goes as $O(n^{-.5})$.

Proof of variance formula The first step is to use the identity $B(2i+1,2n-2i+1)=(2i)!(2n-2i)!/(2n)!$. So we get

\begin{eqnarray*} {n\choose i}^2 B(2i+1,2n-2i+1) & = & {\frac {(n!)^2}{(i!)^2(n-i)!^2}}{\frac {(2i)!(2n-2i)!}{(2n+1)!}}\\ & = & {\frac 1 {2n+1}}{2n\choose n}^{-1}{2i\choose i}{2n-2i\choose n-i} \end{eqnarray*}

Using the generating function $\sum_{i=0}^{\infty} {2i\choose i}x^i=(1-4x)^{-1/2}$, we see that $\sum_{i=0}^n {2i\choose i}{2n-2i\choose n-i}$ is the coefficient of $x^n$ in $((1-4x)^{-1/2})^2=(1-4x)^{-1}$, that is $4^n$. Summing over $i$ we get the claimed formula for the variance.

Edit: effect of distribution of x points on Lagrange variance As shown in the very interesting answer of Robert Mastragostino, the bad behavior of the Lagrange polynomial can be avoided through judicious choice of $x_i$. This raises the possibility that the lagrange polynomial does uncharacteristically badly for uniformly sampled points. In principle, given iid sample points $x_i\sim \mu$ it is possible that the lagrange polynomials behave sensibly for most choices of $\mu$, with the uniform distribution being just an ``unlucky choice". However, this is not the case, and it turns out that the Lagrange variance still grows exponentially for any choice of $\mu$, with the single exception of the arcsine distribution $d\mu(x)=\textbf{1}_{(-1,1)}\pi^{-1}(1-x^2)^{-1/2}dx$. Note that in Robert Mastragostino's answer, the $x_i$ points are chosen to be Chebyshev nodes. As $n\to\infty$, the density of these points converges to none other than the arcsine distribution. So in a sense, the situation in his answer is the only exception to the rule of exponential growth of the variance.

More precisely,take any continuous distribution $\mu(dx)=p(x)dx$ supported on $(-1,1)$ and consider an infinite sequence of points $\{(x_i,f(x_i)+\epsilon_i)\}_{i=1,\infty}$ where $x_i\sim \mu$ are iid, $f$ is a continuous function and $\epsilon_i$ are iid normal samples. Let $\hat{f_n}(x)$ denote the Lagrange interpolant fitted on the first $n$ points. And let $V_n=\int Var(\hat{f_n})(x)p(x)dx$ denote the corresponding average variance.

Claim: Assume that (1): $p(x)>0$ for all $x\in (-1,1)$ and (2): $\int p(x)^2\sqrt{1-x^2}dx<\infty$. Then $V_n$ grows at least like $O(e^{\epsilon n})$ for some $\epsilon>0$, unless $p$ is the arcsine distribution.

To prove this, note that, as before, $Var(\hat{f_n}(x))=\sum_{i=1}^nl_{i,n}^2$, where now $l_{i,n}$ denotes the ith Lagrange basis polynomial wrt the first $n$ points. Now,

\begin{eqnarray*} \log \int l_{i,n}^2(x)d\mu(x)&\geq& 2\int \log |l_{i,n}(x)|p(x)dx\\ & = & 2\int \sum_{j\neq i, j\leq n} \left(\log |x-x_j|-\log |x_i-x_j|\right)p(x)dx \end{eqnarray*} where we used Jensen's inequality in the first line, and the definition of $l_{i,n}$ in the second.

Let us introduce the following notation: \begin{eqnarray*} f_p(x)&=&\int \log|x-y|p(y)dy\\ c_p& = & \int \log |x-y|p(x)p(y)dxdy \end{eqnarray*}

Then we get

\begin{eqnarray*}\lim\inf n^{-1}\log \int l_{i,n}^2(x)p(x)dx&\geq& 2\lim\inf_n n^{-1}\int \sum_{j\neq i, j\leq n} \left(\log |x-x_j|-\log |x_i-x_j|\right)p(x)dx\\ & \geq & 2 \int \lim\inf_n n^{-1}\sum_{j\neq i, 1,j\leq n} \left(\log |x-x_j|-\log |x_i-x_j|\right)p(x)dx\\ & = & 2\int \left(f_p(x)-f_p(x_i)\right)p(x)dx\\ & = & 2(c_p-f_p(x_i)) \end{eqnarray*} using Fatou's lemma, and then the strong law of large numbers.

Note that $f_p'(x)=H_p(x)$, where $H_p(x)=\int {\frac {p(y)}{x-y}}dy$ is the Hilbert transform of $p$. In particular, $f_p$ is continuous. Now, suppose it is the case that $f_p(x)$ is non-constant. In this case, there exists $\epsilon>0$ as well as an interval $[a,b]$ such that $\inf_{x'\in [a,b]}c_p-f_p(x')\geq\epsilon/2$. By assumption (1), $\int_a^b p(x)dx>0$, so accordingly, $P(\exists i: x_i\in [a,b])=1$. We can reorder finitely many samples without changing the distribution, so WLOG we may assume that $x_1\in [a,b]$. As a consequence $2(c_p-f_p(x_1))\geq\epsilon$.

Combining with the above, we get
$$n^{-1}\log \int l_{1,n}^2(x)p(x)dx\geq \epsilon$$ for all $n$ sufficiently large. By simple rearrangement, this is equivalent to $\int l_{1,n}^2(x)p(x)dx\geq e^{n\epsilon}$ and since $V_n=\int \sum_{i=1}^n l_{i,n}^2(x)p(x)dx \geq \int l_{1,n}^2(x)p(x)dx$ we get the claimed exponential growth, assuming $f_p$ is non-constant.

So the last step is to show that $f_p$ is non-constant if $p$ is not the arcsine distribution, or equivalently, that the hilbert transform $H_p$ is not identically zero. Under assumption (2), this follows from a theorem of Tricomi. Thus we have proved the Claim.

Code for Bernstein polynomials

from scipy.special import binom
import numpy as np
import matplotlib.pyplot as plt
variances=[]
n_range=[5,10,20,50,100]
for n in n_range:
    
    preds=[]
    for _ in range(1000):
        xs=np.linspace(0,1,100)

        X=np.linspace(0,1,n+1)
        Y=np.sin(8*X)+np.random.randn(n+1)*.1
        nu=np.arange(n+1)
        bern=binom(n,nu)[:,None]*(xs[None,:])**(nu[:,None])*(1-xs[None,:])**(n-nu[:,None])*Y[:,None]

        pred=bern.sum(0)
        preds.append(pred)
    preds=np.array(preds)
    variances.append(preds.var(0).mean())
variances=np.array(variances)
plt.scatter(n_range,variances)
plt.xlabel('n')
plt.ylabel('variance')
plt.ylim(0,.005)
plt.title('bernstein polynomial variance')

Code for complex polynomials

n=64
xarg=np.linspace(0,2*np.pi,n+1,endpoint=True)[:n]
y=np.sin(2.5*xarg)+.25*np.random.randn(n)
xcomp=np.exp(1j*xarg)

xs=np.linspace(0,2*np.pi,200)
ys=np.sin(2.5*xs)

X=np.array([xcomp**i for i in range(len(xarg))])
w=np.linalg.solve(X,y)


#only need to go up to middle frequency, bc later ones are complex conjugates
interp=2*np.dot(w[1:n//2+1],np.array([np.exp(1j*xs*i) for i in range(1,n//2+1)]))+w2[0]

plt.plot(xs,interp.real,label='complex polynomial interp')
plt.plot(xs,ys,c='gray',linestyle='--',label='true function')
plt.scatter(xarg,y)
plt.xlabel('complex argument')
plt.legend()
Related Question