Descriptive Statistics – Calculating Standard Error After Log-Transform

confidence intervaldata transformationdescriptive statistics

Consider a random set of numbers that are normally distributed:

x <- rnorm(n=1000, mean=10)

We'd like to know the mean and the standard error on the mean so we do the following:

se <- function(x) { sd(x)/sqrt(length(x)) }
mean(x) # something near 10.0 units
se(x)   # something near 0.03 units

Great!

However, let's assume we don't necessarily know that our original distribution follows a normal distribution. We log-transform the data and perform the same standard error calculation.

z <- log(x, base=10)
mean(z) # something near 1 log units
se(z)   # something near 0.001 log units

Cool, but now we need to back-transform to get our answer in units NOT log units.

10^mean(z) # something near 10.0 units
10^se(z)   # something near 1.00 units

My question: Why, for a normal distribution, does the standard error differ depending on whether it was calculated from the distribution itself or if it was transformed, calculated, and back-transformed? Note: the means came out the same regardless of the transformation.

EDIT #1: Ultimately, I am interested in calculating a mean and confidence intervals for non-normally distributed data, so if you can give some guidance on how to calculate 95% CI's on transformed data including how to back-transform to their native units, I would appreciate it!

END EDIT #1

EDIT #2: I tried using the quantile function to get the 95% confidence intervals:

quantile(x, probs = c(0.05, 0.95))     # around [8.3, 11.6]
10^quantile(z, probs = c(0.05, 0.95))  # around [8.3, 11.6]

So, that converged on the same answer, which is good. However, using this method doesn't provide the exact same interval using non-normal data with "small" sample sizes:

t <- rlnorm(10)
mean(t)                            # around 1.46 units
10^mean(log(t, base=10))           # around 0.92 units
quantile(t, probs = c(0.05, 0.95))                     # around [0.211, 4.79]
10^(quantile(log(t, base=10), probs = c(0.05, 0.95)))  # around [0.209, 4.28]

Which method would be considered "more correct". I assume one would pick the most conservative estimate?

As an example, would you report this result for the non-normal data (t) as having a mean of 0.92 units with a 95% confidence interval of [0.211, 4.79]?

END EDIT #2

Thanks for your time!

Best Answer

Your main problem with the initial calculation is there's no good reason why $e^{\text{sd}(\log(Y))}$ should be like $\text{sd}(Y)$. It's generally quite different.

In some situations, you can compute a rough approximation of $\text{sd}(Y)$ from $\text{sd}(\log(Y))$ via Taylor expansion.

$$\text{Var}(g(X))\approx \left(g'(\mu_X)\right)^2\sigma^2_X\,.$$

If we consider $X$ to be the random variable on the log scale, here, $g(X)=\exp(X)$

If $\text{Var}(\exp(X))\approx \exp(\mu_X)^2\sigma_X^2$

then $\text{sd}(\exp(X))\approx \exp(\mu_X)\sigma_X$

These notions carry across to sampling distributions.

This tends to work reasonably well if the standard deviation is really small compared to the mean, as in your example.

> mean(y)
[1] 10
> sd(y)
[1] 0.03
> lm=mean(log(y))
> ls=sd(log(y))
> exp(lm)*ls
[1] 0.0300104

If you want to transform a CI for a parameter, that works by transforming the endpoints.

If you're trying to transform back to obtain point estimate and interval for the mean on the original (unlogged) scale, you will also want to unbias the estimate of the mean (see the above link): $E(\exp(X))\approx \exp(\mu_X)\cdot (1+\sigma_X^2/2)$, so a (very) rough large sample interval for the mean might be $(c.\exp(L),c.\exp(U))$, where $L,U$ are the upper and lower limits of a log-scale interval, and $c$ is some consistent estimate of $1+\sigma_X^2/2$.

If your data are approximately normal on the log scale, you may want to treat it as a problem of producing an interval for a lognormal mean.

(There are other approaches to unbiasing mean estimates across transformations; e.g. see Duan, N., 1983. Smearing estimate: a nonparametric retransformation method. JASA, 78, 605-610)

Related Solutions

Confidence Interval – Calculate for Mean of Log-Normal Data Sets

There are several ways for calculating confidence intervals for the mean of a lognormal distribution. I am going to present two methods: Bootstrap and Profile likelihood. I will also present a discussion on the Jeffreys prior.

Bootstrap

For the MLE

In this case, the MLE of $(\mu,\sigma)$ for a sample $(x_1,...,x_n)$ are

$$\hat\mu= \dfrac{1}{n}\sum_{j=1}^n\log(x_j);\,\,\,\hat\sigma^2=\dfrac{1}{n}\sum_{j=1}^n(\log(x_j)-\hat\mu)^2.$$

Then, the MLE of the mean is $\hat\delta=\exp(\hat\mu+\hat\sigma^2/2)$. By resampling we can obtain a bootstrap sample of $\hat\delta$ and, using this, we can calculate several bootstrap confidence intervals. The following R codes shows how to obtain these.

rm(list=ls())
library(boot)

set.seed(1)

# Simulated data
data0 = exp(rnorm(100))

# Statistic (MLE)

mle = function(dat){
m = mean(log(dat))
s = mean((log(dat)-m)^2)
return(exp(m+s/2))
}

# Bootstrap
boots.out = boot(data=data0, statistic=function(d, ind){mle(d[ind])}, R = 10000)
plot(density(boots.out$t))

# 4 types of Bootstrap confidence intervals
boot.ci(boots.out, conf = 0.95, type = "all")

For the sample mean

Now, considering the estimator $\tilde{\delta}=\bar{x}$ instead of the MLE. Other type of estimators might be considered as well.

rm(list=ls())
library(boot)

set.seed(1)

# Simulated data
data0 = exp(rnorm(100))

# Statistic (MLE)

samp.mean = function(dat) return(mean(dat))

# Bootstrap
boots.out = boot(data=data0, statistic=function(d, ind){samp.mean(d[ind])}, R = 10000)
plot(density(boots.out$t))

# 4 types of Bootstrap confidence intervals
boot.ci(boots.out, conf = 0.95, type = "all")

Profile likelihood

For the definition of likelihood and profile likelihood functions, see. Using the invariance property of the likelihood we can reparameterise as follows $(\mu,\sigma)\rightarrow(\delta,\sigma)$, where $\delta=\exp(\mu+\sigma^2/2)$ and then calculate numerically the profile likelihood of $\delta$.

$$R_p(\delta)=\dfrac{\sup_{\sigma}{\mathcal L}(\delta,\sigma)}{\sup_{\delta,\sigma}{\mathcal L}(\delta,\sigma)}.$$

This function takes values in $(0,1]$; an interval of level $0.147$ has an approximate confidence of $95\%$. We are going to use this property for constructing a confidence interval for $\delta$. The following R codes shows how to obtain this interval.

set.seed(1)

# Simulated data
data0 = exp(rnorm(100))

# Log likelihood
ll = function(mu,sigma) return( sum(log(dlnorm(data0,mu,sigma))))

# Profile likelihood
Rp = function(delta){
temp = function(sigma) return( sum(log(dlnorm(data0,log(delta)-0.5*sigma^2,sigma)) ))
max=exp(optimize(temp,c(0.25,1.5),maximum=TRUE)$objective     -ll(mean(log(data0)),sqrt(mean((log(data0)-mean(log(data0)))^2))))
return(max)
}

vec = seq(1.2,2.5,0.001)
rvec = lapply(vec,Rp)
plot(vec,rvec,type="l")

# Profile confidence intervals
tr = function(delta) return(Rp(delta)-0.147)
c(uniroot(tr,c(1.2,1.6))$root,uniroot(tr,c(2,2.3))$root)

$\star$ Bayesian

In this section, an alternative algorithm, based on Metropolis-Hastings sampling and the use of the Jeffreys prior, for calculating a credibility interval for $\delta$ is presented.

Recall that the Jeffreys prior for $(\mu,\sigma)$ in a lognormal model is

$$\pi(\mu,\sigma)\propto \sigma^{-2},$$

and that this prior is invariant under reparameterisations. This prior is improper, but the posterior of the parameters is proper if the sample size $n\geq 2$. The following R code shows how to obtain a 95% credibility interval using this Bayesian model.

library(mcmc)

set.seed(1)

# Simulated data
data0 = exp(rnorm(100))

# Log posterior
lp = function(par){
if(par[2]>0) return( sum(log(dlnorm(data0,par[1],par[2]))) - 2*log(par[2]))
else return(-Inf)
}

# Metropolis-Hastings
NMH = 260000
out = metrop(lp, scale = 0.175, initial = c(0.1,0.8), nbatch = NMH)

#Acceptance rate
out$acc

deltap = exp(  out$batch[,1][seq(10000,NMH,25)] + 0.5*(out$batch[,2][seq(10000,NMH,25)])^2  )

plot(density(deltap))

# 95% credibility interval
c(quantile(deltap,0.025),quantile(deltap,0.975))

Note that they are very similar.

Solved – How to add confidence intervals to predicted data when the response variable is log transformed

@ogrisel: bootstrap seems overkill here! rather:

preds <- predict.glm(biomass,type="response",se.fit=TRUE,newdata=nd) 
logci <- preds$fit+(preds$se)%*%t(qnorm(c(0.025,0.5,0.975)))
ci <- exp(logci)
dimnames(ci)[[2]]<-c("lower95", "est", "upper95")

should work (if confidence intervals for predictions are actually what you want)

cheers

Best Answer

Related Solutions

Confidence Interval – Calculate for Mean of Log-Normal Data Sets

Solved – How to add confidence intervals to predicted data when the response variable is log transformed

Related Question