Solved – Finding the appropriate polynomial fit in Python

curve fittingnumpyoverfittingpythonregression

Is there a function or library in Python to automatically compute the best polynomial fit for a set of data points?

I am not really interested in the ML use case of generalizing to a set of new data, I am just focusing on the data I have. I realize that the higher the degree, the better the fit. However, I want something that penalizes or looks at where the error elbows. When I say elbowing, I mean something like this (although usually it is not so drastic or obvious): enter image description here

One idea I had was to use Numpy's polyfit(https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.polyfit.html) to compute polynomial regression for a range of orders/degrees. polyfit requires the user to specify the degree of polynomial, which poses a challenge because I don't have any assumptions or preconceived notions. The higher the degree of fit, the lower the error will be but eventually it plateaus like the image above. Therefore, if I want to automatically compute the degree of polynomial where the error curve elbows e.g. if my error is E and d is my degree, I want to maximize (E[d+1]-E[d]) – (E[d+1] – E[d]).

Is this even a valid approach? Are there other tools and approaches, perhaps using well-established Python libraries like numpy or scipy, that can help finding the appropriate polynomial fit (without the order/degree being specified)? I would appreciate any thoughts or suggestions! Thanks!

Best Answer

Usually you would not fit polynomial models to your data willy nilly without good reasons. So assuming this is not a problem and is acceptable to you, I present two options:

  1. You use Cross-Validation to determine which model (which polynomial) is the most appropriate, by maximizing a measure such as accuracy or RMSE (depending if you have a classification or regression problem),
  2. You use an ANOVA test to compare sequential (nested) models, that is model with polynomial degree 1, degree 2, degree 3, ... and so on; and you use last still significant model which decreases the Sum of Squares (SS).
Related Question