Solved – Determining best fitting curve fitting function out of linear, exponential, and logarithmic functions

curve fittingmodel selectionpredictive-modelsregression

Context:

From a question on Mathematics Stack Exchange (Can I build a program), someone has a set of $x-y$ points, and wants to fit a curve to it, linear, exponential or logarithmic.
The usual method is to start by choosing one of these (which specifies the model), and then do the statistical calculations.

But what is really wanted is to find the 'best' curve out of linear, exponential or logarithmic.

Ostensibly, one could try all three, and choose the best fitted curve of the three according to the best correlation coefficient.

But somehow I'm feeling this is not quite kosher. The generally accepted method is to pick your model first, one of those three (or some other link function), then from the data calculate the coefficients. And post facto picking the best of all is cherry picking. But to me whether you're determining a function or coefficients from the data it is still the same thing, your procedure is discovering the best…thing (let's say that which function is -also- another coefficient o be discovered).

Questions:

Is it appropriate to choose the best fitting model out of linear, exponential, and logarithmic models, based on a comparison of fit statistics?
If so, what is the most appropriate way to do this?
If regression helps find parameters (coefficients) in a function, why can't there be a discrete parameter to choose which of three curve families the best would come from?

Best Answer

You might want to check out the free software called Eureqa. It has the specific aim of automating the process of finding both the functional form and the parameters of a given functional relationship.
If you are comparing models, with different numbers of parameters, you will generally want to use a measure of fit that penalises models with more parameters. There is a rich literature on which fit measure is most appropriate for model comparison, and issues get more complicated when the models are not nested. I'd be interested to hear what others think is the most suitable model comparison index given your scenario (as a side point, there was recently a discussion on my blog about model comparison indices in the context of comparing models for curve fitting).
From my experience, non-linear regression models are used for reasons beyond pure statistical fit to the given data:
1. Non-linear models make more plausible predictions outside the range of the data
2. Non-linear models require fewer parameters for equivalent fit
3. Non-linear regression models are often applied in domains where there is substantial prior research and theory guiding model selection.

Related Solutions

Solved – Selecting best model based on linear, quadratic and cubic fit of data

The general term for what you are asking about is model selection. You have a set of possible models, in this case something like $$ \begin{aligned} y&=\beta_1x + \beta_0\\ y&=\beta_2x^2 + \beta_1x + \beta_0 \\ y&=\beta_3x^3 + \beta_2x^2 + \beta_1x + \beta_0 \\ \end{aligned}$$ and you want to determine which of these models is most parsimonious with your data. We generally worry about parsimony rather than best-fitting (i.e, highest $R^2$) since a complex model could "over-fit" the data. For example imagine your timing data is generated by a quadratic algorithm, but there's a little bit of noise in the timing (random paging by the OS, clock inaccuracy, cosmic rays, whatever). The quadratic model might still fit reasonably well, but it won't be perfect. However, we can find a (very high order) polynomial that goes through each and every data point. This model fits perfectly but will be terrible at making future predictions and, obviously, doesn't match the underlying phenomenon either. We want to balance model complexity with the model's explanatory power. How does one do this?

There are many options. I recently stumbled upon this review by Zucchini, which might be a good overview. One approach is to calculate something like the AIC (Akaike Information Criterion), which adjusts each model's likelihood to take the number of parameters into account. These are often relatively easy to compute. For example, AIC is: $$ AIC = 2k -2ln(L) $$ where L is the likelihood of the data given the model and k is the number of parameters (e.g., 2 for linear, 3 for quadratic, etc). You compute this criterion for each model, then choose the model with the smallest AIC.

Another approach is to use cross-validation (or something like that) to show that none of your models are over-fit. You could then select the best-fitting model.

That's sort of the general case. However, as @Michelle noted above, you probably don't want to be doing model selection at all if you know something about the underlying phenomemon. In this case, if you have the code or know the underlying algorithm, you should just trace through it to determine the algorithm's order.

Also, keep in mind that the Big-O order of the algorithm isn't technically defined in terms of the best-fit to the observed run time; it's more of a limiting property. You could feasibly have an algorithm with a massive linear component and a small quadratic component to its runtime, something like $$t(N) = 0.0000001n^2 + 999999999n$$ I would bet that a runtime-vs-input size plot for that would be pretty linear-looking over the ranges you're likely to test, but I believe the algorithm would technically be considered $O(n^2)$

Solved – How to interpret the covariance matrix from a curve fit

As a clarification, the variable pcov from scipy.optimize.curve_fit is the estimated covariance of the parameter estimate, that is loosely speaking, given the data and a model, how much information is there in the data to determine the value of a parameter in the given model. So it does not really tell you if the chosen model is good or not. See also this.

The problem what is a good model is indeed a hard problem. As argued by statisticians

All models are wrong, but some are useful

So the criteria to use in comparison of different models depends on what you want to achieve.

For instance, if you want a curve that is the "close as possible" to the data, you could select a model which gives the smallest residual. In your case it would be the model func and the estimated parameters popt that has the lowest value when computing

numpy.linalg.norm(y-func(x, *popt))

However, if you select a model with more parameters, the residual will automatically decrease, at the cost of higher model complexity. So then it comes back to what the goal is of the model.