Solved – Why does the bootstrap interval have terrible coverage

bootstrapdiagnostic

I wanted to do a class demonstration where I compare a t-interval to a bootstrap interval and calculate the coverage probability of both. I wanted the data to come from a skewed distribution so I chose to generate the data as exp(rnorm(10, 0, 2)) + 1, a sample of size 10 from a shifted lognormal. I wrote a script to draw 1000 samples and, for each sample, calculate both a 95% t-interval and a 95% bootstrap percentile interval based on 1000 replicates.

When I run the script, both methods give very similar intervals and both have coverage probability of 50-60%. I was surprised because I thought the bootstrap interval would be better.

My question is, have I

  • made a mistake in the code?
  • made a mistake in calculating the intervals?
  • made a mistake by expecting the bootstrap interval to have better coverage properties?

Also, is there a way to construct a more reliable CI in this situation?

 tCI.total <- 0
 bootCI.total <- 0
 m <- 10 # sample size
 true.mean <- exp(2) + 1

for (i in 1:1000){
 samp <- exp(rnorm(m,0,2)) + 1
 tCI <- mean(samp) + c(1,-1)*qt(0.025,df=9)*sd(samp)/sqrt(10)

 boot.means <- rep(0,1000)
 for (j in 1:1000) boot.means[j] <- mean(sample(samp,m,replace=T))
 bootCI <- sort(boot.means)[c(0.025*length(boot.means), 0.975*length(boot.means))]

 if (true.mean > min(tCI) & true.mean < max(tCI)) tCI.total <- tCI.total + 1
 if (true.mean > min(bootCI) & true.mean < max(bootCI)) bootCI.total <- bootCI.total + 1 
}
tCI.total/1000     # estimate of t interval coverage probability
bootCI.total/1000  # estimate of bootstrap interval coverage probability

Best Answer

Bootstrap diagnostics and remedies by Canto, Davison, Hinkley & Ventura (2006) seems to be a logical point of departure. They discuss multiple ways the bootstrap can break down and - more importantly here - offer diagnostics and possible remedies:

  1. Outliers
  2. Incorrect resampling model
  3. Nonpivotality
  4. Inconsistency of the bootstrap method

I don't see a problem with 1, 2 and 4 in this situation. Let's look at 3. As @Ben Ogorek notes (although I agree with @Glen_b that the normality discussion may be a red herring), the validity of the bootstrap depends on the pivotality of the statistic we are interested in.

Section 4 in Canty et al. suggests resampling-within-resamples to get a measure of bias and variance for the parameter estimate within each bootstrap resample. Here is code to replicate the formulas from p. 15 of the article:

library(boot)
m <- 10 # sample size
n.boot <- 1000
inner.boot <- 1000

set.seed(1)
samp.mean <- bias <- vars <- rep(NA,n.boot)
for ( ii in 1:n.boot ) {
    samp <- exp(rnorm(m,0,2)) + 1
    samp.mean[ii] <- mean(samp)
    foo <- boot(samp,statistic=function(xx,index)mean(xx[index]),R=inner.boot)
    bias[ii] <- mean(foo$t[,1])-foo$t0
    vars[ii] <- var(foo$t[,1])
}

opar <- par(mfrow=c(1,2))
    plot(samp.mean,bias,xlab="Sample means",ylab="Bias",
        main="Bias against sample means",pch=19,log="x")
    abline(h=0)
    plot(samp.mean,vars,xlab="Sample means",ylab="Variance",
        main="Variance against sample means",pch=19,log="xy")
par(opar)

bootstrap diagnostics

Note the log scales - without logs, this is even more blatant. We see nicely how the variance of the bootstrap mean estimate goes up with the mean of the bootstrap sample. This to me looks like enough of a smoking gun to attach blame to nonpivotality as a culprit for the low confidence interval coverage.

However, I'll happily admit that one could follow up in lots of ways. For instance, we could look at how whether the confidence interval from a specific bootstrap replicate includes the true mean depends on the mean of the particular replicate.

As for remedies, Canty et al. discuss transformations, and logarithms come to mind here (e.g., bootstrap and build confidence intervals not for the mean, but for the mean of the logged data), but I couldn't really make it work.

Canty et al. continue to discuss how one can reduce both the number of inner bootstraps and the remaining noise by importance sampling and smoothing as well as add confidence bands to the pivot plots.

This might be a fun thesis project for a smart student. I'd appreciate any pointers to where I went wrong, as well as to any other literature. And I'll take the liberty of adding the diagnostic tag to this question.