Solved – Selecting best model based on linear, quadratic and cubic fit of data

least squaresmodel selectionregression

I have a Java code that performs a linear regression on a set of data using the Gauss-Jordan elimination. It calculates a linear, quadratic and cubic functions using the least squares method.

My problem is choosing a function out of the three that best models my data. This is ofcourse, without plotting the curve. So lets say I have a set of data {x:1,2,3,4} {y:3,8,910}, and I get a linear, quadratic and cubic function for it using the least squares method.

How do I choose which function is the right model for my data?

Best Answer

The general term for what you are asking about is model selection. You have a set of possible models, in this case something like $$ \begin{aligned} y&=\beta_1x + \beta_0\\ y&=\beta_2x^2 + \beta_1x + \beta_0 \\ y&=\beta_3x^3 + \beta_2x^2 + \beta_1x + \beta_0 \\ \end{aligned}$$ and you want to determine which of these models is most parsimonious with your data. We generally worry about parsimony rather than best-fitting (i.e, highest $R^2$) since a complex model could "over-fit" the data. For example imagine your timing data is generated by a quadratic algorithm, but there's a little bit of noise in the timing (random paging by the OS, clock inaccuracy, cosmic rays, whatever). The quadratic model might still fit reasonably well, but it won't be perfect. However, we can find a (very high order) polynomial that goes through each and every data point. This model fits perfectly but will be terrible at making future predictions and, obviously, doesn't match the underlying phenomenon either. We want to balance model complexity with the model's explanatory power. How does one do this?

There are many options. I recently stumbled upon this review by Zucchini, which might be a good overview. One approach is to calculate something like the AIC (Akaike Information Criterion), which adjusts each model's likelihood to take the number of parameters into account. These are often relatively easy to compute. For example, AIC is: $$ AIC = 2k -2ln(L) $$ where L is the likelihood of the data given the model and k is the number of parameters (e.g., 2 for linear, 3 for quadratic, etc). You compute this criterion for each model, then choose the model with the smallest AIC.

Another approach is to use cross-validation (or something like that) to show that none of your models are over-fit. You could then select the best-fitting model.

That's sort of the general case. However, as @Michelle noted above, you probably don't want to be doing model selection at all if you know something about the underlying phenomemon. In this case, if you have the code or know the underlying algorithm, you should just trace through it to determine the algorithm's order.

Also, keep in mind that the Big-O order of the algorithm isn't technically defined in terms of the best-fit to the observed run time; it's more of a limiting property. You could feasibly have an algorithm with a massive linear component and a small quadratic component to its runtime, something like $$t(N) = 0.0000001n^2 + 999999999n$$ I would bet that a runtime-vs-input size plot for that would be pretty linear-looking over the ranges you're likely to test, but I believe the algorithm would technically be considered $O(n^2)$