[Math] When fitting a polynomial to data points, how to determine the reasonable degree to use

interpolationmathematical modelingregression

I have wondered the following: Suppose that there is a set of data points $(x_i,y_i)$. Then I would like to know if it is more reasonable to assume if there is a polynomial relation of degree $m$ between them or of degree $n$. Is there way to measure it? I know that Lagrange's polynomial gives the exact relation but for example physics formula $F=ma$ says that sometimes it is correct to choose linear polynomial to model the phenomenon.

Best Answer

Part of the issue is whether you want your function to fit the data "as closely as possible", or if you want it to hit every data point exactly.

For example, if you want to fit some data that appears linear, using linear least squares approximation to find the two coefficients which minimize the error is the right way to go. However, if you want an exact estimate, you might want to look at Lagrange Interpolation.

It sounds like you want a "close as possible fit", but you want to compare the accuracy of Polynomials of different degrees. You can use least squares techniques to find the coefficients of a polynomial of a given degree. To do this, you will use a matrix containing powers of your data points and a vector containing your coefficients.

Say we have d data points, and we want a degree n polynomial. Then our matrix will have d rows and n+1 columns. The ith row contains the powers, 0 through n, of the ith data point. The vector contains the constant, then the linear coefficient, and so on.

Multiplying the matrix and the vector gives you a vector of dimension d. (Independent of the degree of the polynomial used!) Typically we use these objects to minimize the error, but once you have the best coefficients for a given degree, you can multiply the matrix by the coefficient vector, and finally subtract the vector containing the y-values. The norm of this vector (X Powers)*(Coefs) - (Y data) is the square root of the sum of the squares of the error at each data point.

If you find this norm for several different degrees, you can find the degree polynomial with the lowest error, and that should be the closest approximation for the degrees tested.

Best of luck!