Solved – reliable nonparametric confidence interval for the mean of a skewed distribution

bootstrapconfidence intervalmeanmediannonparametric

Very skewed distributions such as the log-normal do not result in accurate bootstrap confidence intervals. Here is an example showing that the left and right tail areas are far from the ideal 0.025 no matter which bootstrap method you try in R:

require(boot)
n    <- 25
B    <- 1000
nsim <- 1000
set.seed(1)
which <- c('basic', 'perc', 'norm', 'bca', 'stud')
mul <- 0; sdl <- 1.65   # on log scale
dist <- c('normal', 'lognormal')[2]
switch(dist, normal    = {g <- function(x) x; mu <- mul},
             lognormal = {g <- exp; mu <- exp(mul + sdl * sdl / 2)})
count <- matrix(0, nrow=length(which), ncol=2,
                dimnames=list(which, c('lower', 'upper')))
stat <- function(x, j) {
## See http://www.psychology.mcmaster.ca/bennett/boot09/percentileT.pdf
  x <- x[j]
  m <- mean(x)
  s <- sd(x)
  n <- length(x)
  sem <- s / sqrt(n)
  m.var <- sem ^ 2
  c(m, m.var)
}
for(i in 1 : nsim) {
  if(i %% 100 == 0) cat(i, '')
  x <- g(rnorm(n, mul, sdl))
  b  <- boot(x, stat, R=B)
  ci <- boot.ci(b, type=which)
  for(w in which) {
    nam <- switch(w, perc='percent', norm='normal', basic='basic',
                  stud='student', bca='bca')
    z <- rev(rev(ci[[nam]])[1:2])
    count[w, 'lower'] <- count[w, 'lower'] + (z[1] > mu)
    count[w, 'upper'] <- count[w, 'upper'] + (z[2] < mu)
  }
}
cat('\n')
count / nsim

The result is below:

      lower upper
basic 0.000 0.329
perc  0.003 0.257
norm  0.000 0.287
bca   0.015 0.185
stud  0.005 0.129

For $n=400$ single bootstraps still do not provide adequately accurate coverage:

      lower upper
basic 0.001 0.114
perc  0.005 0.093
norm  0.002 0.102
bca   0.017 0.067
stud  0.011 0.058

Empirical likelihood also fails to provide accurate confidence intervals when sampling from the lognormal distribution.

Is there a general-purpose approach out there that does not depend on knowing the distribution in advance? Has anyone tried to get confidence intervals for the mean by fitting the data to the Tukey generalized $\lambda$ distribution (this distribution is highly flexible)? What about using Kolmogorov-Smirnov confidence bands for the CDF? Would computing the mean on the upper and lower bounds on the CDF be horribly conservative? I would settle for some conservatism if a method has wide applicability.

To restate the goals, I'm seeking a generally applicable approach for getting a confidence interval for a population mean such that

  1. the interval is asymmetric if the raw data distribution is asymmetric
  2. the interval has correct coverage in both tails (e.g., 0.025 error probability in both)
  3. the procedure does not require the analyst to specify anything about the underlying distribution or the transformation needed to make the distribution symmetric

Note that the central limit theorem is irrelevant here; I have a fixed small sample size and the confidence interval must be asymmetric to be accurate in both tails. The parametric $t$-based confidence interval under a lognormal model with $\mu=0, \sigma=1.65$ and $n=20000$ still has bad coverage (left tail error 0.012, right 0.047 when both should be 0.025).

In continuing to think about this there are two broad ways of conceptualizing the problem that I'd like to discuss.

  1. The mean is not a quantity that lends itself to nonparametric inference, at least when exactness of inference is required. The sample median is meaningful for any continuous distribution and we have a simple exact confidence interval for the median. In a sample of size $n=20$ from a normal distribution the confidence interval for the median is $1.28 \times$ longer than the exact $t$-based confidence interval for the mean (see code below). Perhaps this factor of 1.28 is a reasonable price to pay for robustness and complete distributional freedom.
  2. Even though no single bootstrap will give adequately accurate confidence limits for samples from extremely skewed distributions, the double bootstrap can significantly improve the confidence coverage in both tails. Nankervis has some nice results and provides an excellent computational algorithm. But no software I could find implements this.

R code illustrating 1. above:

## Exact CI for median from DescTools package SignTest.default
## See also ttp://www.stat.umn.edu/geyer/old03/5102/notes/rank.pdf,
## http://de.scribd.com/doc/75941305/Confidence-Interval-for-Median-Based-on-Sign-Test
cimed <- function(x, alpha=0.05, na.rm=FALSE) {
  if(na.rm) x <- x[! is.na(x)]
  n <- length(x)
  k <- qbinom(p=alpha / 2, size=n, prob=0.5, lower.tail=TRUE)
  ## Actual CL: 1 - 2 * pbinom(k - 1, size=n, prob=0.5) >= 1 - alpha
  sort(x)[c(k, n - k + 1)]
}

n <- 20
m <- 20000
cil <- cilt <- 0
z <- qt(0.975, n - 1)

for(i in 1 : m) {
  x <- rnorm(n)
  cil  <- cil + diff(cimed(x))
  cilt <- cilt + 2 * z * sqrt(var(x) / n)
}
cil  <- cil / m
cilt <- cilt / m

c(cil, cilt, cilt / cil, cil / cilt)

Best Answer

I am somewhat pessimistic about a such non-parametric method, at least without the introduction of some sort of constraints on the underlying distribution.

My reasoning for this is that there will always be a distribution that breaks the true coverage probability for any finite $n$ (although as $n \rightarrow \infty$, this distribution will become more and more pathological), or the confidence interval will have to be arbitrarily large.

To illustrate, you could imagine a distribution that looks like a normal up to some value $\alpha$, but after $\alpha$ becomes extremely right skewed. This can have unbounded influence on the distribution's mean and as you push $\alpha$ out as far as possible, this can have arbitrarily small probability of making it into your sample. So you can imagine that for any $n$, you could pick an $\alpha$ to be so large that all points in your sample have extremely high probability of looking like it comes from a normal distribution with mean = 0, sd = 1, but you can also have any true mean.

So if you're looking for proper asymptotic coverage, of course this can be achieved by the CLT. However, your question implies that you are (quite reasonably) interested in the finite coverage. As my example shows, there will always be a pathological case that ruins any finite length CI.

Now, you still could have a non-parametric CI that achieves good finite coverage by adding constraints to your distribution. For example, the log-concave constraint is a non-parametric constraint. However, it seems inadequate for your problem, as log-normal is not log-concave.

Perhaps to help illustrate how difficult your problem could be, I've done unpublished work on a different constraint: inverse convex (if you click on my profile, I have a link to a personal page that has a preprint). This constraint includes most, but not all log-normals. You can also see that for this constraint, the tails can be "arbitrarily heavy", i.e. for any inverse convex distribution up to some $\alpha$, you can have heavy enough tails that the mean will be as large as you like.

Related Question