Solved – Determining best fitting curve fitting function out of linear, exponential, and logarithmic functions

curve fittingmodel selectionpredictive-modelsregression

Context:

From a question on Mathematics Stack Exchange (Can I build a program), someone has a set of $x-y$ points, and wants to fit a curve to it, linear, exponential or logarithmic.
The usual method is to start by choosing one of these (which specifies the model), and then do the statistical calculations.

But what is really wanted is to find the 'best' curve out of linear, exponential or logarithmic.

Ostensibly, one could try all three, and choose the best fitted curve of the three according to the best correlation coefficient.

But somehow I'm feeling this is not quite kosher. The generally accepted method is to pick your model first, one of those three (or some other link function), then from the data calculate the coefficients. And post facto picking the best of all is cherry picking. But to me whether you're determining a function or coefficients from the data it is still the same thing, your procedure is discovering the best…thing (let's say that which function is -also- another coefficient o be discovered).

Questions:

  • Is it appropriate to choose the best fitting model out of linear, exponential, and logarithmic models, based on a comparison of fit statistics?
  • If so, what is the most appropriate way to do this?
  • If regression helps find parameters (coefficients) in a function, why can't there be a discrete parameter to choose which of three curve families the best would come from?

Best Answer

  • You might want to check out the free software called Eureqa. It has the specific aim of automating the process of finding both the functional form and the parameters of a given functional relationship.
  • If you are comparing models, with different numbers of parameters, you will generally want to use a measure of fit that penalises models with more parameters. There is a rich literature on which fit measure is most appropriate for model comparison, and issues get more complicated when the models are not nested. I'd be interested to hear what others think is the most suitable model comparison index given your scenario (as a side point, there was recently a discussion on my blog about model comparison indices in the context of comparing models for curve fitting).
  • From my experience, non-linear regression models are used for reasons beyond pure statistical fit to the given data:
    1. Non-linear models make more plausible predictions outside the range of the data
    2. Non-linear models require fewer parameters for equivalent fit
    3. Non-linear regression models are often applied in domains where there is substantial prior research and theory guiding model selection.