Solved – How to determine what degree of polynomial to fit to data

machine learningpolynomialregression

Say you have to fit a polynomial to data that is generated by another polynomial, for example. What is the process of determining what degree polynomial to use to fit that data?

Best Answer

I propose this be done via cross validation. In short, the data is split into K "folds". Each of the K-folds take turns acting as the test set, while the remaining K-1 are used to train a model. The model is used to predict the test set and error is recorded. The cross validated error is the average error on the K test sets. This process is repeated for each model you want to evaluate. The model with the best cv error is selected.

Each of your polynomial degrees is a separate model. Here is some code to run an example:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline

def make_poly_features(x,degree):

    X = np.zeros(shape = (x.size, degree+1))
    X[:,0] = 1
    for i in range(degree):
        X[:,i+1] = np.power(x,i+1)

    betas = np.random.normal(0, 2, size = X.shape[1])

    y = X@betas + np.random.normal(0, 4, size = x.size)

    return y, betas


degree = np.random.randint(low = 2, high = 6)
x = np.random.normal(size = 100)
y, coef = make_poly_features(x,degree)

plt.scatter(x,y)

model = make_pipeline(StandardScaler(), PolynomialFeatures(), LinearRegression())

parms = {'polynomialfeatures__degree': np.arange(2, 6)}

gscv = GridSearchCV(model, parms, cv = 10, scoring='neg_mean_squared_error')
gscv.fit(x.reshape(-1,1),y)

space = np.linspace(-3,3,101).reshape(-1,1)

est_deg= gscv.best_params_['polynomialfeatures__degree']

plt.plot(space, gscv.predict(space), color = 'red')
plt.title(f'True Degree: {degree}  Estimated Degree:{est_deg}')

enter image description here

I randomly generate a polynomial degree and then generate data from a polynomial of that degree. I then use some canned functions to perform the estimation. If you need background on any of these processes, I suggest you read Introduction to statistical learning, particularly chapter 5. The sklearn documentation is also quite useful and has some background theory.