Solved – Why are there large coefficents for higher-order polynomial

curve fittingleast squarespolynomialregression

In Bishop's book on machine learning, it discusses the problem of curve-fitting a polynomial function to a set of data points.

Let M be the order of the polynomial fitted. It states as that

We see that, as M increases, the magnitude of the coefficients
typically gets larger. In particular for the M = 9 polynomial, the
coefficients have become finely tuned to the data by developing large
positive and negative values so that the corresponding polynomial
function matches each of the data points exactly, but between data
points (particularly near the ends of the range) the function exhibits
the large oscillations.

I don't understand why large values implies more closely fitting the data points. I would think the values would become more precise after the decimal instead for better fitting.

Best Answer

This is a well known issue with high-order polynomials, known as Runge's phenomenon. Numerically it is associated with ill-conditioning of the Vandermonde matrix, which makes the coefficients very sensitive to small variations in the data and/or roundoff in the computations (i.e. the model is not stably identifiable). See also this answer on the SciComp SE.

There are many solutions to this problem, for example Chebyshev approximation, smoothing splines, and Tikhonov regularization. Tikhonov regularization is a generalization of ridge regression, penalizing a norm $||\Lambda \theta]||$ of the coefficient vector $\theta$, where for smoothing the weight matrix $\Lambda$ is some derivative operator. To penalize oscillations, you might use $\Lambda \theta=p^{\prime\prime}[x]$, where $p[x]$ is the polynomial evaluated at the data.

EDIT: The answer by user hxd1011 notes that some of the numerical ill-conditioning problems can be addressed using orthogonal polynomials, which is a good point. I would note however that the identifiability issues with high-order polynomials still remain. That is, numerical ill-conditioning is associated with sensitivity to "infinitesimal" perturbations (e.g. roundoff), while "statistical" ill-conditioning concerns sensitivity to "finite" perturbations (e.g. outliers; the inverse problem is ill-posed).

The methods mentioned in my second paragraph are concerned with this outlier sensitivity. You can think of this sensitivity as violation of the standard linear regression model, which by using an $L_2$ misfit implicitly assumes the data is Gaussian. Splines and Tikhonov regularization deal with this outlier sensitivity by imposing a smoothness prior on the fit. Chebyshev approximation deals with this by using an $L_{\infty}$ misfit applied over the continuous domain, i.e. not just at the data points. Though Chebyshev polynomials are orthogonal (w.r.t. a certain weighted inner product), I believe that if used with an $L_2$ misfit over the data they would still have outlier sensitivity.

Related Solutions

Regression – Why Overfitted Models Tend to Have Large Coefficients

In the regularisation context a "large" coefficient means that the estimate's magnitude is larger than it would have been, if a fixed model specification had been used. It's the impact of obtaining not just the estimates, but also the model specification, from the data.

Consider what a procedure like stepwise regression will do for a given variable. If the estimate of its coefficient is small relative to the standard error, it will get dropped from the model. This could be because the true value really is small, or simply because of random error (or a combination of the two). If it's dropped, then we no longer pay it any attention. On the other hand, if the estimate is large relative to its standard error, it will be retained. Notice the imbalance: our final model will reject a variable when the coefficient estimate is small, but we will keep it when the estimate is large. Thus we are likely to overestimate its value.

Put another way, what overfitting means is you're overstating the impact of a given set of predictors on the response. But the only way that you can overstate the impact is if the estimated coefficients are too big (and conversely, the estimates for your excluded predictors are too small).

What you should do is incorporate into your experiment a variable selection procedure, eg stepwise regression via step. Then repeat your experiment multiple times, on different random samples, and save the estimates. You should find that all the estimates of the coefficients $\beta_3$ to $\beta_{10}$ are systematically too large, when compared to not using variable selection. Regularisation procedures aim to fix or mitigate this problem.

Here's an example of what I'm talking about.

repeat.exp <- function(M)
{
    x <- seq(-2, 2, len=25)
    px <- poly(x, 10)
    colnames(px) <- paste0("x", 1:10)
    out <- setNames(rep(NA, 11), c("(Intercept)", colnames(px)))
    sapply(1:M, function(...) {
        y <- x^2 + rnorm(N, s=2)
        d <- data.frame(px, y)
        b <- coef(step(lm(y ~ x1, data=d), y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10, trace=0))
        out[names(b)] <- b
        out
    })
}

set.seed(53520)
z <- repeat.exp(M=1000)

# some time later...
rowMeans(abs(z), na.rm=TRUE)

(Intercept)          x1          x2          x3          x4          x5          x6          x7          x8          x9         x10 
   1.453553    3.162100    6.533642    3.108974    3.204341    3.131208    3.118276    3.217231    3.293691    3.149520    3.073062

Contrast this to what happens when you don't use variable selection, and just fit everything blindly. While there is still some error in the estimates of $\beta_3$ to $\beta_{10}$, the average deviation is much smaller.

repeat.exp.base <- function(M)
{
    x <- seq(-2, 2, len=25)
    px <- poly(x, 10)
    colnames(px) <- paste0("x", 1:10)
    out <- setNames(rep(NA, 11), c("(Intercept)", colnames(px)))
    sapply(1:M, function(...) {
        y <- x^2 + rnorm(N, s=2)
        d <- data.frame(px, y)
        b <- coef(lm(y ~ ., data=d))
        out[names(b)] <- b
        out
    })
}

set.seed(53520)
z2 <- repeat.exp.base(M=1000)

rowMeans(abs(z2))
(Intercept)          x1          x2          x3          x4          x5          x6          x7          x8          x9         x10 
   1.453553    1.676066    6.400629    1.589061    1.648441    1.584861    1.611819    1.607720    1.656267    1.583362    1.556168

Also, both L1 and L2 regularisation make the implicit assumption that all your variables, and hence coefficients, are in the same units of measurement, ie a unit change in $\beta_1$ is equivalent to a unit change in $\beta_2$. Hence the usual step of standardising your variables before applying either of these techniques.

Solved – Curve fitting vs function approximation

They are very similar, and different people in different community may have different definition. The following answer is based on my understanding.

In numerical analysis framework, "curve fitting" is often used to describe interpolation, where the ultimate goal would be trying to minimize the "training loss", i.e., the loss for all seen data points. And there is no notion about "over-fitting", which means if the model can perfectly pass though all the data points, the model is perfect.
On the other hand, "function approximation" may be more used in "machine learning" community, where just like any other learning problems, there are samples from the function (training data), and there are "ground truth function" or "hold out data for validation", therefore we may need to consider the "over-fitting". Where, perfectly predict seen data may not be good enough.

Best Answer

Related Solutions

Regression – Why Overfitted Models Tend to Have Large Coefficients

Solved – Curve fitting vs function approximation

Related Question