Solved – Double exponential fit in R

curve fittingr

I am not yet very good friends with R, but I hope to be! Right now I'm having a lot of troubles with fitting a model to a set of data.

My data looks like:

data <- as.data.frame(cbind(c(0,5,10,15,30,50,60,90,120,180,240),c(0.48, 1.15, 1.03, 1.37, 5.55, 16.77, 20.97, 21.67, 10.50, 2.28, 1.58)))
colnames(data) <- c("times","RNA")

The column RNA is dependent on the time.

I have been trying to fit a double exponential curve to the data, but I don't think it is working so well. This is the code for the fitting:

library(minpack.lm)
fit <- nlsLM(RNA~z*(exp(-v*times)-exp(-u*times)),data=test_2,start=list(u=0.67,v=0.018,z=0.98))

plot(data,pch=19)
lines(data[1],fitted(fit),col=2)

which produces this:
plot

I am not very happy with the fit, since it doesn't really include the "top" of the data (max?). Do any of you have any suggestions about what to do next? I have no clue about what to try else, but maybe another exponential?

Do you have some suggested reading for this, since I'm not looking for a direct answer but more for some ideas…if it's possible!

Best Answer

The functional form of your double exponential curve doesn't capture the shape of your data. Notice that the best fit that you obtained is concave down at $t=0$, where you might expect a gentle slope and positive second derivative.

The "double exponential" functional form is usually associated with the Laplace distribution: $$f(t):= a\exp\left(-\frac{|t-c|}b\right) $$ where $a$ measures the height, $b$ the 'slope', and $c$ the location of the peak. Here's an R implementation using the nls function:

test <- data.frame(times=c(0,5,10,15,30,50,60,90,120,180,240), RNA=c(0.48, 1.15, 1.03, 1.37, 5.55, 16.77, 20.97, 21.67, 10.50, 2.28, 1.58))

nonlin <- function(t, a, b, c) { a * (exp(-(abs(t-c) / b))) }
nlsfit <- nls(RNA ~ nonlin(times, a, b, c), data=test, start=list(a=25, b=20, c=75))

with(test, plot(times, RNA, pch=19, ylim=c(0,40)))
tseq <- seq(0,250,.1)
pars <- coef(nlsfit)
print(pars)
lines(tseq, nonlin(tseq, pars[1], pars[2], pars[3]), col=2)

The resulting fit is OK, but it doesn't capture the right skew in your data set, nor does it do a good job near zero:

enter image description here

One way to fix this is to use a different slope for the left half and the right half of the peak, yielding a four-parameter model:

nonlin <- function(t, a, bl, br, c) {
    ifelse(t < c, a * exp((t-c) / bl), a * exp(-(t-c) / br))
}

nlsfit <- nls(RNA ~ nonlin(times, a, bl, br, c), data=test, start=list(a=25, bl=20, br=20, c=75))

With this modification the fit is quite a bit better:

enter image description here

Another approach is to take the log of the time values to remove the skew. In essence you're fitting a double exponential relationship between RNA and log(time):

nonlin <- function(t, a, b, c) { a * (exp(-(abs(log(t)-c) / b))) }
nlsfit <- nls(RNA ~ nonlin(times, a, b, c), data=test, start=list(a=25, b=1, c=4))

With this three parameter model the fit is closer to zero at $t=0$ (perhaps too close to zero), while the right tail has gotten thicker:

enter image description here

Finally, you can fit a lognormal distribution to your data. In this case you are positing a bell shape for RNA vs log(time):

nonlin <- function(t, a, b, c) { a * (exp(-(log(t)-c)^2 / b)) }
nlsfit <- nls(RNA ~ nonlin(times, a, b, c), data=test, start=list(a=25, b=1, c=4))

Here the resulting fit is perhaps overly aggressive: it doesn't become positive until well past $t=0$.

enter image description here

BTW, here is an R implementation of the fit to the Gumbel distribution, which is sometimes known as the double exponential. This is the functional form used in James Phillips' answer, and perhaps what you intended to code up.

nonlin <- function(t, a, b, c) { (a/b) * exp( -(t-c)/b - exp(-(t-c)/b) ) }
nlsfit <- nls(RNA ~ nonlin(times, a, b, c), data=test, start=list(a=1000, b=50, c=75))

enter image description here

Which functional form is the right one? Which one to adopt depends not only on the quality of the fit but also on the mechanism that generated the data. Is there a theoretical/biological reason for any of these functional forms? OTOH, maybe none of these forms is justified, and you just want to describe what you've observed using a small number of parameters.