Solved – Fitting a power law to the relationship between two variables

fittingpower law

I have a small set of data that I need to fit a curve to (see image and data below).

enter image description here

algaeConc    adultDens
2.934961    6665.145469
2.921015    6665.144453
2.910769    6665.142377
2.908331    6665.138925
2.907658    6665.134515
2.907278    6665.128736
2.907130    6665.121034
2.907160    6665.117992
2.907326    6665.124547
2.907594    6665.131266

The data came out of a simulation model for which I was stressing a parameter (filtration rate) then recording what happened to my variables (algae concentration and adult density). Now, I need to fit a curve to this, and I have to admit I never saw this kind of decline before. So, I would like some suggestions about how to fit a curve to this. It seems to me that a power law might be needed, but I have never worked with those before and am not sure how to go about it.

I've spent all day reading about power laws and trying to fit it on R but only managed to confuse myself, so any help will be highly appreciated.

Edit: The X axis is inverted to match the stress on the filtration parameter (increasing stress, caused a decrease in algae concentration).

Best Answer

Here's the unflipped version of your plot:

plot of data from question

No choices of $a$ and $b$ in $y=ax^{-b}$ (where $y$ is adult density and $x$ is algae concentration) will fit that shape of relationship; they don't increase and flatten off up to a positive asymptote:

plot of various power laws for positive a

(with negative $a$ you can get it to increase to an asymptote, but the fit will be all negative)

Indeed, the range of your $x$-variable is so narrow (relative to its mean say) that a typical power curve will look effectively like a straight line.


A quick power-law diagnostic plot:

If you don't have intuition for power laws, to see whether a power law is likely to be a reasonable description, plot log y against log x (it doesn't matter which base of logs). It should look close to linear on that scale.

In your case we can see immediately that won't happen (indeed the coefficients of variation are so small that the log-scale plot looks a lot like the first plot above aside from the axis scales).

There are a number of nonlinear functions used for this kind of "saturation" relationship we see in the plot of the data (dozens of them, really). It's best to proceed from theoretical notions of what is happening (which might lead to a differential equation, say, and then to a functional form) than to guess at functional forms; indeed that sort of theoretical understanding is where most of the common increasing-to-an-asymptote type models arise.


Here's a few examples of nonlinear models that increase to an asymptote (at least for some parameter values); most of them will fit your data poorly:

\begin{eqnarray} E(y) &=& ax/(b+x) \\ E(y) &=& a(1-\exp(-bx)) \\ E(y) &=& a (1 - \exp(-b(x - c))) \\ E(y) &=& a+(c-a)\exp(-bx) \\ E(y) &=& a - c(x-d)^{-b} \\ E(y) &=& a \exp(-b/x) \end{eqnarray}

There are many others besides. You should be looking at subject-area knowledge to develop a model, though -- trawling through such a list of models to find one that fits effectively ruins any statistical inference on the same data, so unless you want to go collect another set of data (or indeed possibly several more), don't do that. Strictly speaking you should have a model in place before you even collect data.


Indeed there further consequences to using data-generated models / data-generated hypotheses that I haven't touched on there; there are a number of posts on site that discuss the effect of this kind of approach to choosing models.

In the situation where there's simply no subject-area knowledge to work form, and where the experimenter can collect any amount of data easily, such curve fitting strategies (trawling through a list) may have some value in helping to come up with plausible models -- as long as we take great care to make sure that the choice isn't just an artifact of our one set of data and we take similarly great care not to let the model-selection process affect our ultimate estimation and inference.

For example if we take enough data that we can do out of sample testing (i.e. we can see if the model performs well on data we didn't use to pick the model) and other people repeat the experiment (so we can see whether the same curve tends to be picked up across multiple independently-run experiments - that it's robust to the small changes in the setup that naturally occur when one goes to a different lab with their own pieces of equipment and different people running the experiment).

Once that's all done and the model actually can be seen to "work" in some sense, we can then sample an entirely new data set on which to perform estimation and inference (again, where possible, using out-of-sample assessment to avoid over-reliance on asymptotic approximations and on assuming the model is actually the data-generating process when computing thinks like intervals and standard errors). If we can't do something like that, such list trawling can be quite dangerous, leading to models that tell us something about the particulars of our data and our modelling process but little of value about the thing we want to model.