Confidence Interval – Calculate for Mean of Log-Normal Data Sets

confidence intervallognormal distributionmean

I've heard/seen in several places that you can transform the data set into something that is normal-distributed by taking the logarithm of each sample, calculate the confidence interval for the transformed data, and transform the confidence interval back using the inverse operation (e.g. raise 10 to the power of the lower and upper bounds, respectively, for $\log_{10}$).

However, I'm a bit suspicious of this method, simply because it doesn't work for the mean itself: $10^{\operatorname{mean}(\log_{10}(X))} \ne \operatorname{mean}(X)$

What is the correct way to do this? If it doesn't work for the mean itself, how can it possibly work for the confidence interval for the mean?

Best Answer

There are several ways for calculating confidence intervals for the mean of a lognormal distribution. I am going to present two methods: Bootstrap and Profile likelihood. I will also present a discussion on the Jeffreys prior.

Bootstrap

For the MLE

In this case, the MLE of $(\mu,\sigma)$ for a sample $(x_1,...,x_n)$ are

$$\hat\mu= \dfrac{1}{n}\sum_{j=1}^n\log(x_j);\,\,\,\hat\sigma^2=\dfrac{1}{n}\sum_{j=1}^n(\log(x_j)-\hat\mu)^2.$$

Then, the MLE of the mean is $\hat\delta=\exp(\hat\mu+\hat\sigma^2/2)$. By resampling we can obtain a bootstrap sample of $\hat\delta$ and, using this, we can calculate several bootstrap confidence intervals. The following R codes shows how to obtain these.

rm(list=ls())
library(boot)

set.seed(1)

# Simulated data
data0 = exp(rnorm(100))

# Statistic (MLE)

mle = function(dat){
m = mean(log(dat))
s = mean((log(dat)-m)^2)
return(exp(m+s/2))
}

# Bootstrap
boots.out = boot(data=data0, statistic=function(d, ind){mle(d[ind])}, R = 10000)
plot(density(boots.out$t))

# 4 types of Bootstrap confidence intervals
boot.ci(boots.out, conf = 0.95, type = "all")

For the sample mean

Now, considering the estimator $\tilde{\delta}=\bar{x}$ instead of the MLE. Other type of estimators might be considered as well.

rm(list=ls())
library(boot)

set.seed(1)

# Simulated data
data0 = exp(rnorm(100))

# Statistic (MLE)

samp.mean = function(dat) return(mean(dat))

# Bootstrap
boots.out = boot(data=data0, statistic=function(d, ind){samp.mean(d[ind])}, R = 10000)
plot(density(boots.out$t))

# 4 types of Bootstrap confidence intervals
boot.ci(boots.out, conf = 0.95, type = "all")

Profile likelihood

For the definition of likelihood and profile likelihood functions, see. Using the invariance property of the likelihood we can reparameterise as follows $(\mu,\sigma)\rightarrow(\delta,\sigma)$, where $\delta=\exp(\mu+\sigma^2/2)$ and then calculate numerically the profile likelihood of $\delta$.

$$R_p(\delta)=\dfrac{\sup_{\sigma}{\mathcal L}(\delta,\sigma)}{\sup_{\delta,\sigma}{\mathcal L}(\delta,\sigma)}.$$

This function takes values in $(0,1]$; an interval of level $0.147$ has an approximate confidence of $95\%$. We are going to use this property for constructing a confidence interval for $\delta$. The following R codes shows how to obtain this interval.

set.seed(1)

# Simulated data
data0 = exp(rnorm(100))

# Log likelihood
ll = function(mu,sigma) return( sum(log(dlnorm(data0,mu,sigma))))

# Profile likelihood
Rp = function(delta){
temp = function(sigma) return( sum(log(dlnorm(data0,log(delta)-0.5*sigma^2,sigma)) ))
max=exp(optimize(temp,c(0.25,1.5),maximum=TRUE)$objective     -ll(mean(log(data0)),sqrt(mean((log(data0)-mean(log(data0)))^2))))
return(max)
}

vec = seq(1.2,2.5,0.001)
rvec = lapply(vec,Rp)
plot(vec,rvec,type="l")

# Profile confidence intervals
tr = function(delta) return(Rp(delta)-0.147)
c(uniroot(tr,c(1.2,1.6))$root,uniroot(tr,c(2,2.3))$root)

$\star$ Bayesian

In this section, an alternative algorithm, based on Metropolis-Hastings sampling and the use of the Jeffreys prior, for calculating a credibility interval for $\delta$ is presented.

Recall that the Jeffreys prior for $(\mu,\sigma)$ in a lognormal model is

$$\pi(\mu,\sigma)\propto \sigma^{-2},$$

and that this prior is invariant under reparameterisations. This prior is improper, but the posterior of the parameters is proper if the sample size $n\geq 2$. The following R code shows how to obtain a 95% credibility interval using this Bayesian model.

library(mcmc)

set.seed(1)

# Simulated data
data0 = exp(rnorm(100))

# Log posterior
lp = function(par){
if(par[2]>0) return( sum(log(dlnorm(data0,par[1],par[2]))) - 2*log(par[2]))
else return(-Inf)
}

# Metropolis-Hastings
NMH = 260000
out = metrop(lp, scale = 0.175, initial = c(0.1,0.8), nbatch = NMH)

#Acceptance rate
out$acc

deltap = exp(  out$batch[,1][seq(10000,NMH,25)] + 0.5*(out$batch[,2][seq(10000,NMH,25)])^2  )

plot(density(deltap))

# 95% credibility interval
c(quantile(deltap,0.025),quantile(deltap,0.975))

Note that they are very similar.

Related Solutions

Solved – Prediction interval for a fitted log-normal distribution

Here is one simple approach:

> x.logmod = lm(log(x) ~ 1)
> exp(predict(x.logmod, newdata = data.frame(junk = 0), interval = "predict"))
       fit       lwr      upr
1 1.094619 0.1773106 6.757576

The linear model obtains the mean of $\log(x)$. The predict statement can compute a prediction interval for a new dataset, so if we un-transform it, we get a prediction interval for $x$ itself. The newdata argument may be skipped if you want 100 copies of the same interval! Instead, I provided a dataset that has just one row; since we are predicting the intercept, it doesn't matter what's in it.

Solved – reliable nonparametric confidence interval for the mean of a skewed distribution

I am somewhat pessimistic about a such non-parametric method, at least without the introduction of some sort of constraints on the underlying distribution.

My reasoning for this is that there will always be a distribution that breaks the true coverage probability for any finite $n$ (although as $n \rightarrow \infty$, this distribution will become more and more pathological), or the confidence interval will have to be arbitrarily large.

To illustrate, you could imagine a distribution that looks like a normal up to some value $\alpha$, but after $\alpha$ becomes extremely right skewed. This can have unbounded influence on the distribution's mean and as you push $\alpha$ out as far as possible, this can have arbitrarily small probability of making it into your sample. So you can imagine that for any $n$, you could pick an $\alpha$ to be so large that all points in your sample have extremely high probability of looking like it comes from a normal distribution with mean = 0, sd = 1, but you can also have any true mean.

So if you're looking for proper asymptotic coverage, of course this can be achieved by the CLT. However, your question implies that you are (quite reasonably) interested in the finite coverage. As my example shows, there will always be a pathological case that ruins any finite length CI.

Now, you still could have a non-parametric CI that achieves good finite coverage by adding constraints to your distribution. For example, the log-concave constraint is a non-parametric constraint. However, it seems inadequate for your problem, as log-normal is not log-concave.

Perhaps to help illustrate how difficult your problem could be, I've done unpublished work on a different constraint: inverse convex (if you click on my profile, I have a link to a personal page that has a preprint). This constraint includes most, but not all log-normals. You can also see that for this constraint, the tails can be "arbitrarily heavy", i.e. for any inverse convex distribution up to some $\alpha$, you can have heavy enough tails that the mean will be as large as you like.

Best Answer

Related Solutions

Solved – Prediction interval for a fitted log-normal distribution

Solved – reliable nonparametric confidence interval for the mean of a skewed distribution

Related Question