Solved – Calculate accelerated bootstrap interval in R

bootstrapconfidence intervalr

I am trying to to calculate bootstrap confidence interval on an index calculated from a vector of values, and if the index is significantly greater than 0 in R.

For example, the vector of length 6: (0,0, 100, 30, 200,6).
And I calculate the index with:

J = (var(vector)/mean(vector)^2) - (1/mean(vector))

I am trying to use a method of accelerated bootstrap from another paper that has done it in SAS, but I don't know what the R equivalent is? I dabbled with using boot.ci, but I wasn't sure how to specify it and if it was correct.

The bit from the paper I was referring to reads:

"We used the accelerated bootstrap (Dixon 2001, SAS) to estimate 95%
confidence intervals for all aggregation indices and to test whether
the parameter estimated by the index J differed significantly from 0
at alpha = 0.05"

Best Answer

First a warning... the Bootstrap (as with most statistical methods) is unlikely to be reliable with such a small sample size. I would exercise caution if $n=6$ is a standard sample size in your case.

Lets simulate some data

set.seed(42)
n <- 30 #Sample size
x <- round(runif(n, 0, 100))

Lets refer to your index as $\theta$ and the estimator you provide for it as $\hat\theta$, which can be computed as follows.

theta_hat <- var(x)/mean(x)^2 - 1/mean(x)

For this simulated data, I get $\hat\theta = 0.2104$ and (by cranking $n$ wayyyy up) we have (roughly) $\theta = 0.32$.

Obtain the Bootstrap distribution

The Bootstrap algorithm is fairly straightforward to code up on your own.

B <- 10000 #number of bootstrap resamples
theta_boot <- rep(NA, B)
for(i in 1:B){
  #Select a bootstrap sample
  xnew <- sample(x, length(x), replace=TRUE)
  #Estimate index
  theta_boot[i] <- var(xnew)/mean(xnew)^2 - 1/mean(xnew)
}

#Plot bootstrap distribution
hist(theta_boot, breaks=30, xlab='theta', main='Bootstrap distribution')
abline(v=0.32, lwd=2, col='orange')

The resulting distribution looks like this, where the vertical line represents the "true" value of the index $\theta$.

Confidence intervals using the (percentile) Bootstrap

At this point, getting a confidence interval is very straightforward. Suppose you want a $95\%$ CI (i.e. $\alpha = 0.05$). You are looking for the points $L$ and $U$ such that $2.5\%$ of the Bootstrap samples are below $L$ and above $U$.

Mathematically, this is equivalent to setting $$L = \hat F^{-1}(\alpha/2) \quad\quad\quad U = \hat F^{-1}(1-\alpha/2),$$ where $\hat F$ is the "Bootstrap CDF". In R, this can be done simply by typing

alpha <- 0.05
quantile(theta_boot, c(alpha/2, 1-alpha/2))

For this data, we get a $95\%$ CI of $(0.101, 0.355)$.

The Accelerated Bootstrap

Although the method of the previous section is a straightforward and natural way to obtain endpoints for a confidence interval, there are several alternatives which have been shown to perform better in a variety of settings. The Accelerated Bootstrap is one such method.

The endpoints to the CI in this approach are found by considering the function $$g(u) = \hat F^{-1}\left(\Phi\left(z_0 + \frac{z_0 + z_u}{1-a(z_0+z_u)}\right) \right)$$ and setting $L = g(\alpha/2)$ and $U=g(1-\alpha/2)$. There are a lot of new terms in this function which I will now describe.

$\Phi(z)$ represents the standard normal CDF.
$z_0 = \Phi^{-1}(\hat F(\hat\theta)).$
$z_u = \Phi^{-1}(u).$
$a$ is an "acceleration constant".

Estimation of the acceleration constant is the last remaining "challenge" and will be discussed in the next section. For now, let's fix the value $a=0.046$. The accelerate Bootstrap CI can now be computed in R as follows.

#Desired quantiles
u <- c(alpha/2, 1-alpha/2) 

#Compute constants
z0 <- qnorm(mean(theta_boot <= theta_hat))
zu <- qnorm(u)
a <- 0.046 

#Adjusted quantiles
u_adjusted <- pnorm(z0 + (z0+zu)/(1-a*(z0+zu))) 

#Accelerated Bootstrap CI
quantile(theta_boot, u_adjusted)

This gives a new $95\%$ CI of $(0.114, 0.383)$, which has effectively "shifted" the CI bounds in the direction of the true value for $\theta$. (Side note: when $a=0$, the accelerated Bootstrap is known as the bias correction Bootstrap).

The following figure shows the Bootstrap distribution again, with vertical lines representing the Confidence intervals for each case.

Estimating the acceleration constant

The acceleration constant can (in some cases) be calculated theoretically from the data by assuming a particular distribution for the data. Otherwise, a non-parametric approach can be used.

Efron (1987) shows that for univariate sampling distributions, the acceleration constant is reasonably well approximated by $$\hat a = \frac{1}{6}\frac{\sum_{i=1}^n I_i^3}{\left(\sum_{i=1}^nI_i^2\right)^{3/2}}$$ where $I_i$ denotes the influence of point $x_i$ on the estimation of $\theta$. Efron proposes approximating $I_i$ using the infinitesimal jackknife, but others have demonstrated that the finite-sample Jackknife is often sufficient. Thus, each $I_i$ can be approximated by $$I_i = (n-1)[\hat\theta - \hat\theta_{-i}]$$ where $\hat\theta_{-i}$ represents an estimate of $\theta$ (your index) after removing the $i^{th}$ data point.

I <- rep(NA, n)
for(i in 1:n){
   #Remove ith data point
   xnew <- x[-i]
   #Estimate theta
   theta_jack <- var(xnew)/mean(xnew)^2 - 1/mean(xnew)
   I[i] <- (n-1)*(theta_hat - theta_jack)
}
#Estimate a
a_hat <- (sum(I^3)/sum(I^2)^1.5)/6

This leads to the accleration constant estimate of $\hat a = 0.046$ that was used in the previous section.

Related Solutions

Solved – Using Bootstrap to estimate confidence interval of the standard deviation

I would say that since sample mean is $\overline{X}=\frac{\sum{X_i}}{n}$, its convergence rate is 1/n, which is also the convergence rate of sample variance. But the convergence rate of sample standard deviation $ 1/\sqrt{n} $.

Solved – How to calculate confidence interval using parametric bootstrap

Edit: Sorry about my confusion in the beginning... But I read over the article, which luckily is in open access if anyone want to chip in on this.

So I first have some general comments regarding the code that you provided, then some questions regarding the modelling.

In the article they state the following:

When testing for an age dependence of the HR, the model with a constant HR was compared to one where the HR was a continuous, piecewise linear function of age which was constant before age 40 years, linear between ages 40 and 60 years and constant after age 60 years.

So when I plot the hr column in your data frame I get the following:

plot of data

So, this is not continuous! So there is something that needs to be fixed here. I think that the code that you would want to use for that is the following:

hr = rep(NA, 51) #hazard ratio
age = c(20:70)
young <- 3.364*10*1.190/100000
old <- 0.980*10*279.0873/100000
for (i in 1:length(hr)) {
  if (age[i] < 30) hr[i] = young
  else if (age[i] >= 50) hr[i] = old
  else hr[i] = hr[i-1] + (1/20)*(old-young)
}

breast.hr = as.data.frame(cbind(age = 20:70, hr, 
                        cumhr= cumsum(hr), cumrr = 1 - exp(-cumsum(hr))))

plot(breast.hr$age,breast.hr$hr)
title("hr column")

That provides the following plot:

new plot

So now this is piecewise linear and continuous as stated in the article. Now the cumulative HR should look like the thick black line in Figure 3 of the article.

Now to continue I need to know exactly what the model is that they are using to be able to perform a parametric bootstrap. In the section Statistical Methods for Penetrance Analyses they state which model they use:

A mixed model was employed which incorporates an unmeasured polygenic factor to model the effect on breast cancer risk of a large number of unmeasured genes in addition to the measured major gene.[29] The polygenic part of this model was implemented via a hypergeometric polygenic model with four loci[30] and postulates a normally distributed random variable G for each person so that these variables are correlated within families (see section 8.9 of Lange et al.[31]). A woman's age at breast cancer diagnosis was modeled as a random variable whose HR was, for noncarriers, exp(G) times the Australian breast cancer incidence rate for 1992–2002[32] or, for carriers, the product of this HR multiplied by the age-specific HR. As in Antoniou et al.,[33] the variance of G was chosen to be 1.67, and the mean was chosen so that the average HR for noncarriers equaled the population incidence.

I think that we need more information about the data to be able to do a parametric bootstrap. If you think can fully specify the model, then I can try to help you!

**EDIT: ** So here I have changed the code to be able to plot the quantiles around the mean curve. You can see that it does not look exactly like Figure 3 in the paper, but that is probably because the inc variable is not continuous in age.

boot.sampling.dist <- matrix(0,5000,61)

R = matrix(cbind(1, -0.1080, -0.1080, 1), nrow = 2)
U = R * (c(0.2668, 0.3814) %*% t(c(0.2668, 0.3814)))
X <- mvrnorm(n=10000,mu=c(3.364,0.980),Sigma=U) 

for (j in 1:5000){

  a = X[j, 1]
  b = X[j, 2]

  hr = rep(NA, 61)
  inc = rep(NA, 61)
  age = c(20:80)

  for (i in 1:length(hr)) {
    if (age[i] < 30) 
    {
      hr[i] = exp(a)
      inc[i] = hr[i]*3.347955/100000
    }
    else if (age[i] >= 50) 
    {
      hr[i] = exp(b)
      inc[i] = hr[i]*279.0873/100000
    }
    else 
    {
      hr[i] = exp(a) + (age[i] - 30)*(exp(b)-exp(a))/20
      inc[i] = hr[i]*75.91559/100000
    }
  }
  cum.inc = cumsum(inc)
  cum.risk = 1 - exp(-cum.inc)

  boot.sampling.dist[j,] <- cum.risk

}

my.quantiles<-quantile(boot.sampling.dist,c(.025,0.975))

my.025Quants <- apply(boot.sampling.dist,2,quantile,c(.025))
my.975Quants <- apply(boot.sampling.dist,2,quantile,c(.975))
my.mean <- apply(boot.sampling.dist,2,mean)

plot(age,my.mean,lwd=2,type="l",ylim=c(0,1))
lines(age,my.025Quants,col="red")
lines(age,my.975Quants,col="red")
title("Cumulative Hazard Ratio")

Plot with bounds