Solved – Double exponential fit in R

curve fittingr

I am not yet very good friends with R, but I hope to be! Right now I'm having a lot of troubles with fitting a model to a set of data.

My data looks like:

data <- as.data.frame(cbind(c(0,5,10,15,30,50,60,90,120,180,240),c(0.48, 1.15, 1.03, 1.37, 5.55, 16.77, 20.97, 21.67, 10.50, 2.28, 1.58)))
colnames(data) <- c("times","RNA")

The column RNA is dependent on the time.

I have been trying to fit a double exponential curve to the data, but I don't think it is working so well. This is the code for the fitting:

library(minpack.lm)
fit <- nlsLM(RNA~z*(exp(-v*times)-exp(-u*times)),data=test_2,start=list(u=0.67,v=0.018,z=0.98))

plot(data,pch=19)
lines(data[1],fitted(fit),col=2)

which produces this:

I am not very happy with the fit, since it doesn't really include the "top" of the data (max?). Do any of you have any suggestions about what to do next? I have no clue about what to try else, but maybe another exponential?

Do you have some suggested reading for this, since I'm not looking for a direct answer but more for some ideas…if it's possible!

Best Answer

The functional form of your double exponential curve doesn't capture the shape of your data. Notice that the best fit that you obtained is concave down at $t=0$, where you might expect a gentle slope and positive second derivative.

The "double exponential" functional form is usually associated with the Laplace distribution: $$f(t):= a\exp\left(-\frac{|t-c|}b\right) $$ where $a$ measures the height, $b$ the 'slope', and $c$ the location of the peak. Here's an R implementation using the nls function:

test <- data.frame(times=c(0,5,10,15,30,50,60,90,120,180,240), RNA=c(0.48, 1.15, 1.03, 1.37, 5.55, 16.77, 20.97, 21.67, 10.50, 2.28, 1.58))

nonlin <- function(t, a, b, c) { a * (exp(-(abs(t-c) / b))) }
nlsfit <- nls(RNA ~ nonlin(times, a, b, c), data=test, start=list(a=25, b=20, c=75))

with(test, plot(times, RNA, pch=19, ylim=c(0,40)))
tseq <- seq(0,250,.1)
pars <- coef(nlsfit)
print(pars)
lines(tseq, nonlin(tseq, pars[1], pars[2], pars[3]), col=2)

The resulting fit is OK, but it doesn't capture the right skew in your data set, nor does it do a good job near zero:

One way to fix this is to use a different slope for the left half and the right half of the peak, yielding a four-parameter model:

nonlin <- function(t, a, bl, br, c) {
    ifelse(t < c, a * exp((t-c) / bl), a * exp(-(t-c) / br))
}

nlsfit <- nls(RNA ~ nonlin(times, a, bl, br, c), data=test, start=list(a=25, bl=20, br=20, c=75))

With this modification the fit is quite a bit better:

Another approach is to take the log of the time values to remove the skew. In essence you're fitting a double exponential relationship between RNA and log(time):

nonlin <- function(t, a, b, c) { a * (exp(-(abs(log(t)-c) / b))) }
nlsfit <- nls(RNA ~ nonlin(times, a, b, c), data=test, start=list(a=25, b=1, c=4))

With this three parameter model the fit is closer to zero at $t=0$ (perhaps too close to zero), while the right tail has gotten thicker:

Finally, you can fit a lognormal distribution to your data. In this case you are positing a bell shape for RNA vs log(time):

nonlin <- function(t, a, b, c) { a * (exp(-(log(t)-c)^2 / b)) }
nlsfit <- nls(RNA ~ nonlin(times, a, b, c), data=test, start=list(a=25, b=1, c=4))

Here the resulting fit is perhaps overly aggressive: it doesn't become positive until well past $t=0$.

BTW, here is an R implementation of the fit to the Gumbel distribution, which is sometimes known as the double exponential. This is the functional form used in James Phillips' answer, and perhaps what you intended to code up.

nonlin <- function(t, a, b, c) { (a/b) * exp( -(t-c)/b - exp(-(t-c)/b) ) }
nlsfit <- nls(RNA ~ nonlin(times, a, b, c), data=test, start=list(a=1000, b=50, c=75))

Which functional form is the right one? Which one to adopt depends not only on the quality of the fit but also on the mechanism that generated the data. Is there a theoretical/biological reason for any of these functional forms? OTOH, maybe none of these forms is justified, and you just want to describe what you've observed using a small number of parameters.

Related Solutions

Solved – How to simplify a stretched exponential fit

The first thing to do, if possible, is to take care of the heteroscedasticity. Notice how the spread of the residuals consistently increases with the fit: in fact, the spread seems to increase almost quadratically with larger fit.

A standard cure is to return to the original response ($log(xy)$) and apply a strong transformation, such as a logarithm or even a reciprocal: something in that range is suggested by this pattern of heteroscedasticity. Then redo the fitting and recheck the residuals.

It's a good idea to fit lines by eye, using graphs of transformed $xy$ against $z$ (or $\log(z)$. This usually reveals more than any amount of manipulating a regression routine. Once you have a suitable model, you can finally use least squares (or robust regression) to produce a final fit.

In this instance you might also want to explore the relationships among $x$ and $z$ and $y$ and $z$ separately to see whether just one of $x$, $y$ is causing the sudden change in slope between 2.9 and 3.6. The change clearly is not quadratic: both "limbs" of the residual plot are linear. One way to model this change--if it persists after you have dealt with the heteroscedasticity--is with a changepoint model that posits one value of the slope $\beta_1$ for, say, $z \le 3.2$, and a different value for $z \gt 3.2$.

Finally, in Monte-Carlo simulations you have full control and you know exactly the mechanism that produces the responses. It can be useful to subject this to some analysis to find out just what the correct relationship among the triples ought to be.

Solved – Exponential decay function

First, in its current form, this isn't an exponential. Here is the code with some plots:

gf2 <- c(1.35, 1.33776990615813, 1.32552158334356, 1.31325480792751, 
1.30096935107516, 1.28866497856582, 1.27634145060489, 1.26399852162705, 
1.25163594009013, 1.23925344825924, 1.22685078198044, 1.21442767044339,
1.20198383593233, 1.18951899356456, 1.17703285101575, 1.1645251082312,
1.15199545712208, 1.13944358124587, 1.12686915546963, 1.11427184561532,
1.10165130808571, 1.08900718946957, 1.07633912612483, 1.06364674373805,
1.05092965685848, 1.03818746840501, 1.02541976914382, 1.01262613713477,
0.999806137143936, 0.986959320019858, 0.974085222030623, 0.961183364158636,
0.948253251349736, 0.935294371712896, 0.922306195666412, 0.909288175026071,
0.896239742030314, 0.883160308296893, 0.870049263704956, 0.856905975195814,
0.843729785484912, 0.830520011676707, 0.817275943773186, 0.803996843065683,
0.790681940398468, 0.777330434291129, 0.763941488905233, 0.75051423183888,
0.737047751730707, 0.723541095652442, 0.70999326626635, 0.696403218720661,
0.682769857252295, 0.669092031461845, 0.655368532220595, 0.64159808716336,
0.627779355713794, 0.613910923580398, 0.599991296651462, 0.586018894205227,
0.571992041337234, 0.557908960489589, 0.543767761945991, 0.529566433130946,
0.51530282652052, 0.500974645933667, 0.486579430925755, 0.472114538946752,
0.457577124852233, 0.44296411726136, 0.428272191135991, 0.413497735800733,
0.398636817423197, 0.383685134710617, 0.368637966230041, 0.353490107290977,
0.338235793692888, 0.322868608762547, 0.30738136887844, 0.291765980931759,
0.276013262639391, 0.260112712872577, 0.24405221348003, 0.227817635241112,
0.211392306411718, 0.19475627881785, 0.177885285875707, 0.160749213513732,
0.143309764509418, 0.125516708879546, 0.10730146998924, 0.0885651903719612,
0.0691537518211836, 0.048795218179725, 0.0268838502712772)

plot(gf2) #Plot gf2
plot(diff(gf2)) #Plot the change in gf2

enter image description here

Notice that the last change in gf2 (see the above plot for diff(gf2)) on the far right hand side is the biggest change. If you fit gf2 directly (probably with a ratio of polynomials), the biggest residual will probably be in the last data point. So, you might try a fit for the difference in gf2 and then sum it up.

Looking at the second plot for diff(gf2), it looks like a logarithmic plot that was simply flipped. Here's some code and a plot to show that:

gf2_rev <- rev(gf2) #Reverse gf2
plot(diff(gf2_rev)) #Plot it

enter image description here

If you fit that third graph, you'll get a simple equation. From there, you'll have to flip it back to get the second plot, and then accumulate the sum to get the first plot.

The alternative is to fit a ratio of polynomials that will have a hard time with that far right data point.

Edit 1 (01/14/2012) ================================================

After some playing around with the data, I came up with the following.

First, as described above, that far right data point makes this a little harder, so I decided to fit the difference in gf2 (diff(gf2)) and then sum the results up to get gf2. Here are the results (the fit is in red):

ind <- 1:94
#Here's the fit equation    
diff_gf2_trial <- -0.1283/log(3.666e4 + (2.01*(ind^2)) - (575.4*ind))      

plot(diff(gf2), main="Fit the difference so the far right data point works")
lines(diff_gf2_trial, col="red") #Overlay the fit equation in red

gf2_trial <- cumsum(c(gf2[1], diff_gf2_trial)) #Build the cumulative sum
plot(gf2, main="Sum up the fit to get the estimated gf2")
lines(gf2_trial, col="red") #Overlay the cumulative sum in red

enter image description here

Hope that helps.

Edit 2 (01/14/2012) =========================================

The general form of the equation came from plotting and rough fitting the data in various ways. My target was a clean fit with a model that was as simple as possible. I settled on a negative inverted exponential. The curve looked reasonably quadratic or cubic. After some quick fits, it turned out that a quadratic was good enough. Here's the code for that plot (I realize that the above description sounds like a bunch of weasel-words, however it comes from experience and a lot of playing around with the data).

plot(exp(-1/diff(gf2)), main="Plot of exp(ScaleValue/diff(gf2))")
lines(exp(-0.99/diff(gf2)), col="green")
lines(exp(-0.98/diff(gf2)), col="orange")
lines(exp(-0.97/diff(gf2)), col="purple")

enter image description here

At this point, all I have to do is figure out the ScaleValue for exp(ScaleValue/diff(gf2)) and then perform a linear fit for the quadratic terms that fit the above plot. Here's the code:

library(DEoptim) #An optimizer

#Build a function that allows the Scale Value (scaval) to vary and
#return 1 - Adjusted R-Squared value (so it can be minimized).
#In other words, for each "scaval" value that is tried, fit all of
#the polynomial terms.

minfun <- function(scaval) {
  1 - summary(lm(exp(as.numeric(scaval)/diff(gf2)) ~ poly(ind, degree=2, raw=TRUE)))$adj.r.squared
}

ind <- 1:length(diff(gf2)) #Make an index for the difference in gf2 to be used in minfun
junk <- DEoptim(minfun, lower=-1, upper=-0.0001) #Find the Scale
scaval <- as.numeric(junk$optim$bestmem) #Get the answer returned from the optimizer

#Build the resulting model
res <- lm(exp(scaval/diff(gf2)) ~ poly(ind, degree=2, raw=TRUE))
summary(res)

#Here's the fit equation
diff_gf2_new <- scaval/log(res$coefficients[1] + (res$coefficients[2]*ind) + (res$coefficients[3]*(ind^2)))
plot(diff(gf2), main="Fit the difference so the far right data point works") #Plot the difference in gf2
lines(diff_gf2_new, col="blue") #Overlay the fit equation in blue

enter image description here

Notice that the coefficients for diff_gf2_new may be a little different than diff_gf2_trial. That's because I originally used weights to make sure that the far right data point was represented cleanly. It turns out, those weights aren't really needed. Here's the final fit:

summary(res)

Call:
lm(formula = exp(scaval/diff(gf2)) ~ poly(ind, degree = 2, raw = TRUE))

Residuals:
     Min       1Q   Median       3Q      Max 
-117.090  -64.603   -6.481   57.591  292.487 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         2.671e+05  2.443e+01   10935   <2e-16 ***
poly(ind, degree = 2, raw = TRUE)1 -4.865e+03  1.187e+00   -4099   <2e-16 ***
poly(ind, degree = 2, raw = TRUE)2  2.161e+01  1.211e-02    1786   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 77.27 on 91 degrees of freedom
Multiple R-squared:     1,  Adjusted R-squared:     1 
F-statistic: 4.742e+07 on 2 and 91 DF,  p-value: < 2.2e-16

Best Answer

Related Solutions

Solved – How to simplify a stretched exponential fit

Solved – Exponential decay function

Related Question