Solved – Why are there large coefficents for higher-order polynomial

curve fittingleast squarespolynomialregression

In Bishop's book on machine learning, it discusses the problem of curve-fitting a polynomial function to a set of data points.

Let M be the order of the polynomial fitted. It states as that

We see that, as M increases, the magnitude of the coefficients
typically gets larger. In particular for the M = 9 polynomial, the
coefficients have become finely tuned to the data by developing large
positive and negative values so that the corresponding polynomial
function matches each of the data points exactly, but between data
points (particularly near the ends of the range) the function exhibits
the large oscillations.

I don't understand why large values implies more closely fitting the data points. I would think the values would become more precise after the decimal instead for better fitting.

Best Answer

This is a well known issue with high-order polynomials, known as Runge's phenomenon. Numerically it is associated with ill-conditioning of the Vandermonde matrix, which makes the coefficients very sensitive to small variations in the data and/or roundoff in the computations (i.e. the model is not stably identifiable). See also this answer on the SciComp SE.

There are many solutions to this problem, for example Chebyshev approximation, smoothing splines, and Tikhonov regularization. Tikhonov regularization is a generalization of ridge regression, penalizing a norm $||\Lambda \theta]||$ of the coefficient vector $\theta$, where for smoothing the weight matrix $\Lambda$ is some derivative operator. To penalize oscillations, you might use $\Lambda \theta=p^{\prime\prime}[x]$, where $p[x]$ is the polynomial evaluated at the data.

EDIT: The answer by user hxd1011 notes that some of the numerical ill-conditioning problems can be addressed using orthogonal polynomials, which is a good point. I would note however that the identifiability issues with high-order polynomials still remain. That is, numerical ill-conditioning is associated with sensitivity to "infinitesimal" perturbations (e.g. roundoff), while "statistical" ill-conditioning concerns sensitivity to "finite" perturbations (e.g. outliers; the inverse problem is ill-posed).

The methods mentioned in my second paragraph are concerned with this outlier sensitivity. You can think of this sensitivity as violation of the standard linear regression model, which by using an $L_2$ misfit implicitly assumes the data is Gaussian. Splines and Tikhonov regularization deal with this outlier sensitivity by imposing a smoothness prior on the fit. Chebyshev approximation deals with this by using an $L_{\infty}$ misfit applied over the continuous domain, i.e. not just at the data points. Though Chebyshev polynomials are orthogonal (w.r.t. a certain weighted inner product), I believe that if used with an $L_2$ misfit over the data they would still have outlier sensitivity.