Solved – How to chose the order for polynomial regression

polynomialregression

How can we know which degree polynomial is the best fir for a data set composed of one predictor and one variable? And how can we evaluate them?

I have developed the linear regression and then went up to the third polynomial degree, but I just need to make how to assess the goodness of fit?

Best Answer

This question can be generalized for selecting any machine learning algorithm hyper-parameters. For example, number of clusters in K-means, number of Hidden unit in neural networks, etc.

At very high level, there are two ways (not mutually exclusive, in fact combining two ways would be ideal.): Data driven and knowledge driven.

  • Data driven means using data to figure out which one is the best. We usually have training set and testing set. There are some other variations, such as adding one additional validation data set, run repeated cross validation etc. But the overall idea is and pick the best one in testing set, and we can make sure testing set is very close to production data.

  • Knowledge driven means using "domain knowledge" to make the decision on parameter tuning. For example, we are fitting some data from some trajectory data and we know our data from physics would generally follow a parabola trend, not a 5th order polynomial curve. Then we would like pick the 2nd order polynomial to fit. In addition, if we know our data is periodic, we may choose Fourier expansion on the data instead of polynomials. See this post What's wrong to fit periodic data with polynomials?

In sum, if we have a lot of data, and can make sure we have a fair representation to production data in testing set. Then data driven would be good. On the other hand, if we have lots of domain knowledge about the relationship between input and output, then knowledge driven is good. The ideal case would be combining two: know the relationship in data and testing it carefully using a good testing set.