Solved – Best approach in R for interpolating and curve fitting a tiny dataset

ggplot2rregressionsplines

I have a set of 'activity' values for some enzyme assays I have been doing, that come out of some analysis I've been doing. The problem is, the data is fairly crap, and there aren't many points, but its for project for my MSc so I'm stuck with it. I appreciate that trying to regress 4 datapoints, and especially extrapolate from them is a statistical no-no, but it's what the PhD student who I'm emulating did (albeit with the help of our Uni's Biostats department – a resource I can't really utilise at this point) – and so I would like the results to be directly comparable.

DPConc DPActivity
0      100.000000
83     67.709971
166    6.296231
416    16.546593

I need to extrapolate from this an IC50 value, which is the concentration (x axis) value, for which 50% Activity is seen.

I want (have) to do this by fitting some kind of curve/spline to the points, but I'm at a bit of a loss. The best I've come up with so far is this:

require(splines)
library(splines)

plot(DPConc, DPActivity)
splineDP <- smooth.spline(DPConc, DPActivity, spar=0.45)
lines(splineDP)

Which for various values of spar gives me this graph (excuse its roughness – I'm really at the end of my tether!)

Dipeptide Activity vs. Concentration

It looks as though some kind of exponential decay is going on, but I'm not enough of a mathematician to be able to concoct the required function. I'd rather derive the function from the data, than make the data fit the function if possible (not least because I have other data that doesn't follow the same trend, though I suspect thats experimental error).

I've tried a couple of different approaches from around the web but my deadline is imminent and I really need the simplest solution. If the solution could be implemented with ggplot, that would be a bonus as my other plots so far have been made with it, the above was just a quick mockup.

Best Answer

According to the OP,

The 'scientific logic' of mass action (which is essentially what this is) would point to a linear or exponential relationship, and it's definitely non-linear

We can start by using ggplot to test out the best model. We'll do it purely visually, by plotting the two different options.

First up, we need to get the data in to a data frame. We'll use the usual method:

DP = data.frame(Activity = c(100,67.7,6.29,16.55),
                Conc = c(0, 83, 166, 416))

Now, we'll plot a linear fit to the data using ggplot's geom_smooth() function:

require(ggplot2)

p.dp.1 <- ggplot(data = DP,
               aes(x = Conc,
                   y = Activity)) + 
  ylab("Activity") + 
  xlab("Concentration") +
  geom_point() + 
  theme_bw(base_size = 8) +
  geom_smooth(method = "lm", 
              aes(colour = "Linear")) +
  scale_color_manual(name = "Fits",
                     breaks = c("Linear"),
                     values = c("blue"))

Which gives us this fit:

enter image description here

We can see the standard error (the grey-shaded area) is pretty large.

Because the OP suggested that this might be an exponential relationship, we'll now try adding a fit using an exponential. An exponential curve can be linearized by taking logs of both sides, and then doing a linear fit to the data, which would be very simple with ggplot. However, we have a problem; log(0) is -Inf, so we can't simply take the logs of both sides and do a linear fit. Instead, we have to use glm() to do the fit, and pass it through geom_smooth(). We can also take advantage of ggplot's ability to build up plots bit-by-bit, and just tack it on to the previous plot:

p.dp.2 <- p.dp.1 +
  geom_smooth(method = "glm",
              family = gaussian(link="log"), 
              aes(colour = "Exponential")) +
  scale_color_manual(name = "Fits",
                     breaks = c("Linear","Exponential"),
                     values = c("red","blue"))

Which gives us this:

enter image description here

Where we can see that the standard error (the grey bound) is much narrower than the linear fit. So, we have some justification of using the exponential, although it would still be nice to have a chemical model of activity versus concentration to base this on.

We now need the concentration at activity = 50. We can plot this using geom_abline():

p.dp.3 <- p.dp.2 +
  geom_abline(intercept = 50, slope =0,
              aes(colour = "Activity = 50")) +
  scale_color_manual(name = "Fits",
                     breaks = c("Linear","Exponential","Activity = 50"),
                     values = c("black","red","blue"))

which gives us this:

enter image description here

Our best estimate for 50% activity is the intersection of the black line of activity = 50 and the best-fit line using the exponential model. We can see that this point is slightly to the right of the data point at Conc = 83, and to the left of the line of concentration = 100.

To get the actual intersect, we create the model for activity as a function of concentration outside of ggplot, and approximate the fit using approx(). The model is set up like this:

dp.model <- glm(formula = Activity ~ Conc,
                family=gaussian(link="log"),
                data = DP)

Following this example we can get the concentration when activity is 50:

conc.est <- approx(x = dp.model$fitted.values,
      y = dp.model$data$Conc, xout=50)$y

and then the concentration is...

> conc.est
89.29575

So, using an exponential model for activity as a function of concentration, we can estimate that concentration = 89.3 when activity = 50. We can see from the figures that the standard error associated with this estimate is huge, though. Unfortunately I can't figure out how to get the error bands for this estimate.

Related Question