Solved – How to un-transform exponential plot data to get back to original data scale

data transformationexponential distributionr

Mock data / RepEx

err <- 0.5*rnorm(101)
x <- seq(from=500, to=1000, by = 5)
y <- exp(.005*x) + err
mydata <- data.frame(x,y)

My raw data contains large numbers, but exp(large number) exceeds the computer's capacity. Yet, this relationship should, biologically, be logarithmic.

So, I simply divided my values by 1000 and all is well:

library(ggplot2)
library(broom)

myfit <- lm(y ~ exp(x/1000), data = mydata)
ggplot(mydata, aes(x,y))+geom_point()

I can get my fitted values:

myfitted <- augment(myfit, data = mydata )

But, I want to visualize how well my fitted values fit my actual data. However, my fitted values are no longer on the same scale as my original data. I'm trying to come up with what to do to adjust them back to the original scale. Ideally, this line should be pretty close to identity (x=x).

I've tried the following:

ggplot(myfitted, aes(x=x, y=1000*log(.fitted))) + 
geom_point(color = "blue") + geom_point(aes(y=x), color="red") +
coord_fixed()

ggplot(myfitted, aes(x=x, y=log(.fitted*1000))) + 
geom_point(color = "blue") + geom_point(aes(y=x), color="red") + 
coord_fixed()

And a few other variations on that theme. It's been a while since I worked with algebra – what can I do to reverse my transformation and view my data back at its original scale?

Best Answer

You are going about this modelling exercise from the wrong direction. You are transforming x which is causing among other things trouble with big values. Instead you could transform y.

Anyway, the reason your attempts are failing is because you are trying to apply the back transformation to the fitted values, but you transformed the x variable before fitting the model.

In this instance you don't need to do anything to the fitted values. And if we plot y against exp(x/1000) you'll also see that the transformation failed to do anything of interest

err <- 0.5*rnorm(101)
x <- seq(from=500, to=1000, by = 5)
y <- exp(.005*x) + err
mydata <- data.frame(x,y, expx = exp(x / 1000))
theme_set(theme_bw())

ggplot(mydata, aes(x = expx, y = y)) + geom_point()

So all your transformation achieved is a rescaling of x — the relationship wasn't linearised at all. If you proceed, you'll just fit a straight line to the non-linear relationship. Let's do that, because it shows that you don't need to fiddle with x at all if you fit the model as you did:

myfit1 <- lm(y ~ exp(x/1000), data = mydata)

newd <- data.frame(x = seq(500, 1000, by = 1))
newd <- transform(newd, Fitted = predict(myfit1, newd),
                        expx   = exp(x / 1000))

ggplot() +
  geom_point(aes(x = x, y = y), mydata) +
  geom_line(aes(x = x, y = Fitted), newd, size = 1)

The plot is the same, except for the labelling on the x-axis, if we plot on the exp(x/1000) scale

ggplot() +
  geom_point(aes(x = expx, y = y), mydata) +
  geom_line(aes(x = expx, y = Fitted), newd, size = 1)

What you can do instead is transform y to linearise the relationship

myfit2 <- lm(log(y) ~ x, data = mydata)

newd <- transform(newd, Fitted2 = exp(predict(myfit2, newd)))

ggplot() +
  geom_point(aes(x = x, y = y), mydata) +
  geom_line(aes(x = x, y = Fitted), newd, size = 1) +
  geom_line(aes(x = x, y = Fitted2), newd, size = 1, colour = "red")

Which now does a much better job of fitting the data.

The basic point here is that if you transform x you don't need to transform y.

Finally, following Mosteller and Tukey's bulging rule, for a relationship seen in your data you could transform y via a sqrt or log transform, or transform x by squaring or cubing it say. So by that rule of thumb you weren't choosing a useful transformation. In this case, we can roughly linearise the relationship by applying the following transformation

$$x^{\prime} = (x/1000)^5$$

(the division by 1000 is there just to avoid very large values of x). A plot of y against the thusly transformed x is shown below along with the regression fit

myfit3 <- lm(y ~ I((x/1000)^5), data = mydata)
newd <- transform(newd, Fitted3 = predict(myfit3, newd))

ggplot() +
  geom_point(aes(x = x, y = y), mydata) +
  geom_line(aes(x = x, y = Fitted3), newd, size = 1, col = "red")

What transformation you choose should however be informed by the system you are studying. The log transform of y works better here because that is how the data were generated.

Best Answer

Related Solutions

Solved – Difference time series before Arima or within Arima

Zero-Inflated Poisson Regression – When to Use Zero-Inflated Poisson Regression and Negative Binomial Distribution

Related Question