Regression – How to Identify and Prevent Overfitting in Polynomial Regression

fittingoverfittingregression

Given a set of data and training points, suppose we obtain a polynomial regression of the form

$$f(x) = w_0 + w_1 x + \ldots w_n x^n$$

  • Is there a rule of thumb as to what order of the polynomial is it
    more likely to overfit? For example, can we say that as the order of the polynomial grow, the likelihood of overfitting increases?

  • Also, is it common to observe explosive behavior for higher order
    polynomials? For example, your data is concentrated between range
    $[-5, +5]$. Could the polynomial regression explode to the range of
    say $[-1000, 1000]$ due to the higher order terms $x^n$, the weights,
    and possibly the outliers in the training data?

Best Answer

  • There is no such rule about specific order polynomials which is agnostic to your dataset. If any such rule existed, I would expect it to be a function of your data or your data generating process - without knowing something about that, it's hard to say. Without saying anything about specific order polynomials, your general statement is right - larger order polynomials are more likely to overfit. As the order of the polynomial increases, so does the variance of the estimator.

  • Yes, this is a common issue with higher order polynomials. It is similar in spirit to Runge's phenomenon. The common solutions are to find the best order via cross-validation (grid search), or by controlling the size of the coefficients with regularization.

Related Question