Solved – Explanation of confidence interval from R function boot.ci

bootstrapconfidence intervalr

I used boot function in R to do bootstrap for 40 times and used boot.ci to get the "normal" confidence interval. The following is my R code:
1. Define the statistic used in boot function
uni_boot <-function(data,indices,vari){ d = data[indices,] unifit = coxph(as.formula(paste('Surv(time, status)~', vari)) ,data = d) # return hazard ratio summary(unifit)$coef[2] }
2.Bootstrap
r1 <- boot(data = data, statistic = uni_boot, R = 40, vari = variable) r2 <- boot.ci(boot.out = r1, type = "norm")
Then, I examine the following things
Result of r1 object:
original bias std. error t1* 1.053145 0.03274176 0.1714448
Result of r2 object
Intervals : Level Normal 95% ( 0.684, 1.356 ) Calculations and Intervals on Original Scale
The bootstrap sample mean would be mean(r1$t) 1.085887 which is original plus the bias. However, when I examine the the formula used to calculate the bootstrap interval. It is original-bias-V^1/2*Z(1-alpha) or original-bias-V^1/2*Z(alpha).
I followed the formula and did the calculation:
1.053145-0.03274176+1.96*0.1714448 got me the upper bound 1.356 1.053145-0.03274176-1.96*0.1714448 got me the lower bound 0.684
My understanding of the confidence interval would be the bootstrap mean in the middle of the boot CI. However, the boot CI midpoint here turns to be 1.024. Sometimes, the boot CI from boot.ci wouldn't cover the bootstrap mean mean(r1$t).
Anything I understand wrongly?

Best Answer

Yes, the right way to correct for bias with bootstrapping is to subtract the bias from the value obtained on the original sample. When you get stuck thinking about bootstrapping, remember the guiding principle:

The population is to the sample as the sample is to the bootstrap samples.

Your value from the original sample is 1.053; you perform bootstrap resamples from the original sample and find a mean value of 1.086. The mean from the bootstrap samples was thus 0.033 higher than the value from the original sample: a bias of +0.033.

Applying the above principle, you use that bias value to estimate that the value from the original sample is 0.033 higher than the population value. So the bias-corrected estimate of the population value is 1.02.

The CI based on bootstrapping are supposed to represent the CI in the original population, while the values from the bootstrapped samples themselves are doubly biased from the population values (once from the original sample versus the population, and then again from the bootstrapped samples versus the original sample). With bias, you thus shouldn't be alarmed that the values obtained from the bootstrapped samples are beyond the final CI estimates for the population.

Issues of confidence intervals from bootstrapping can be even trickier; the "normal" confidence intervals assume symmetry about the (bias-corrected) mean, an assumption that might not best describe the situation in practice. With bias and asymmetry some other frequently used bootstrapping-based estimates of confidence intervals can be misleading. I struggled with these issues extensively until I forced myself to apply the above guiding principle systematically.

Related Solutions

Solved – the meaning of a confidence interval taken from bootstrapped resamples

If the bootstrapping procedure and the formation of the confidence interval were performed correctly, it means the same as any other confidence interval. From a frequentist perspective, a 95% CI implies that if the entire study were repeated identically ad infinitum, 95% of such confidence intervals formed in this manner will include the true value. Of course, in your study, or in any given individual study, the confidence interval either will include the true value or not, but you won't know which. To understand these ideas further, it may help you to read my answer here: Why does a 95% Confidence Interval (CI) not imply a 95% chance of containing the mean?

Regarding your further questions, the 'true value' refers to the actual parameter of the relevant population. (Samples don't have parameters, they have statistics; e.g., the sample mean, $\bar x$, is a sample statistic, but the population mean, $\mu$, is a population parameter.) As to how we know this, in practice we don't. You are correct that we are relying on some assumptions--we always are. If those assumptions are correct, it can be proven that the properties hold. This was the point of Efron's work back in the late 1970's and early 1980's, but the math is difficult for most people to follow. For a somewhat mathematical explanation of the bootstrap, see @StasK's answer here: Explaining to laypeople why bootstrapping works . For a quick demonstration short of the math, consider the following simulation using R:

# a function to perform bootstrapping
boot.mean.sampling.distribution = function(raw.data, B=1000){
  # this function will take 1,000 (by default) bootsamples calculate the mean of 
  # each one, store it, & return the bootstrapped sampling distribution of the mean

  boot.dist = vector(length=B)     # this will store the means
  N         = length(raw.data)     # this is the N from your data
  for(i in 1:B){
    boot.sample  = sample(x=raw.data, size=N, replace=TRUE)
    boot.dist[i] = mean(boot.sample)
  }
  boot.dist = sort(boot.dist)
  return(boot.dist)
}

# simulate bootstrapped CI from a population w/ true mean = 0 on each pass through
# the loop, we will get a sample of data from the population, get the bootstrapped 
# sampling distribution of the mean, & see if the population mean is included in the
# 95% confidence interval implied by that sampling distribution

set.seed(00)                       # this makes the simulation reproducible
includes = vector(length=1000)     # this will store our results
for(i in 1:1000){
  sim.data    = rnorm(100, mean=0, sd=1)
  boot.dist   = boot.mean.sampling.distribution(raw.data=sim.data)
  includes[i] = boot.dist[25]<0 & 0<boot.dist[976]
}
mean(includes)     # this tells us the % of CIs that included the true mean
[1] 0.952

Solved – Bias-corrected percentile confidence intervals

You almost had it. Change your Step 2 code as shown below. You want the value of Z associated with the proportion you computed in Step 1 and that is what qnorm will give you.

 rsq.bc <- quantile(mtcar.boot.rsq$boot.rsq,
                       c(pnorm((2*pnorm(mean(mtcar.boot.rsq$boot.high))) - 1.96),
                         pnorm((2*pnorm(mean(mtcar.boot.rsq$boot.high))) + 1.96)))

becomes

 rsq.bc <- quantile(mtcar.boot.rsq$boot.rsq,
                       c(pnorm((2*qnorm(mean(mtcar.boot.rsq$boot.high))) - 1.96),
                         pnorm((2*qnorm(mean(mtcar.boot.rsq$boot.high))) + 1.96)))

You might find this page helpful: http://influentialpoints.com/Training/bootstrap_confidence_intervals.htm#bias

Best Answer

Related Solutions

Solved – the meaning of a confidence interval taken from bootstrapped resamples

Solved – Bias-corrected percentile confidence intervals

Related Question