Solved – How to calculate confidence interval using parametric bootstrap

confidence intervalr

I was able to calculate the age-specific hazard ratio (HR) for breast cancer at age 40 and 60 years old. The HR was a continuous, piecewise linear function of age which was constant before 40 years, linear between ages 40-60 years, and constant after age 60 years.

I then calculates the age-specific cumulative risk from these HRs as 1-exp(-cumulative HR), and now want to calculate the corresponding confidence intervals. A paper "A PALB2 mutation associated with high risk of breast cancer" by Southey et al (2010) calculated CI by using a parametric bootstrap with 5000 replications. She explained that

5000 draws were taken from the normal distribution that the parameter
estimates would be expected to follow under asymptotic likelihood
theory. For each age, corresponding values of the cumulative risk were
calculated, and the 95% CI was taken to be the 2.5 and 97.5 percentile
of this sample.

I want to do this in R. I think it has to do with the library boot. But I don't quite understand the statistics behind this. Can you please clarify these to me?

Thank you.

Additional info: Here is the code i used to calculate cumulative risk. As you can see, the cumulative risk (at column cumrr) at age 70 is 60.04%. I want to find the confidence interval of the cumrr column.

a = 3.467
b = 3.610

hr = rep(NA, 61)
inc = rep(NA, 61)
age = c(20:80)

for (i in 1:length(hr)) {
  if (age[i] < 30) 
  {
    hr[i] = exp(a)
    inc[i] = hr[i]*3.347955/100000
  }
  else if (age[i] >= 50) 
  {
    hr[i] = exp(b)
    inc[i] = hr[i]*279.0873/100000
  }
  else 
  {
    hr[i] = exp(a) + (age[i] - 30)*(exp(b)-exp(a))/20
    inc[i] = hr[i]*75.91559/100000
  }
}
cum.inc = cumsum(inc)
cum.risk = 1 - exp(-cum.inc)

I would like to note that sd(a) = 0.2668 and sd(b) = 0.3814. The correlation matrix between a and b is

R = matrix(cbind(1, -0.1080, -0.1080, 1), nrow = 2)

I did try to do the following bootstrap, but I think the code was doing a nonparametric bootstrap instead.

boot.sampling.dist <- matrix(1, 5000)

R = matrix(cbind(1, -0.1080, -0.1080, 1), nrow = 2)
U = R * (c(0.2668, 0.3814) %*% t(c(0.2668, 0.3814)))
X <- rmvnorm(n=10000,mean=c(3.364,0.980),sigma=U) 

for (j in 1:5000){

  a = X[j, 1]
  b = X[j, 2]

  hr = rep(NA, 61)
  inc = rep(NA, 61)
  age = c(20:80)

  for (i in 1:length(hr)) {
    if (age[i] < 30) 
    {
      hr[i] = exp(a)
      inc[i] = hr[i]*3.347955/100000
    }
    else if (age[i] >= 50) 
    {
      hr[i] = exp(b)
      inc[i] = hr[i]*279.0873/100000
    }
    else 
    {
      hr[i] = exp(a) + (age[i] - 30)*(exp(b)-exp(a))/20
      inc[i] = hr[i]*75.91559/100000
    }
  }
  cum.inc = cumsum(inc)
  cum.risk = 1 - exp(-cum.inc)

  boot.sampling.dist[j] <- cum.risk[51]

}

my.quantiles<-quantile(boot.sampling.dist,c(.025,0.975))

Best Answer

Edit: Sorry about my confusion in the beginning... But I read over the article, which luckily is in open access if anyone want to chip in on this.

So I first have some general comments regarding the code that you provided, then some questions regarding the modelling.

In the article they state the following:

When testing for an age dependence of the HR, the model with a constant HR was compared to one where the HR was a continuous, piecewise linear function of age which was constant before age 40 years, linear between ages 40 and 60 years and constant after age 60 years.

So when I plot the hr column in your data frame I get the following:

plot of data

So, this is not continuous! So there is something that needs to be fixed here. I think that the code that you would want to use for that is the following:

hr = rep(NA, 51) #hazard ratio
age = c(20:70)
young <- 3.364*10*1.190/100000
old <- 0.980*10*279.0873/100000
for (i in 1:length(hr)) {
  if (age[i] < 30) hr[i] = young
  else if (age[i] >= 50) hr[i] = old
  else hr[i] = hr[i-1] + (1/20)*(old-young)
}

breast.hr = as.data.frame(cbind(age = 20:70, hr, 
                        cumhr= cumsum(hr), cumrr = 1 - exp(-cumsum(hr))))

plot(breast.hr$age,breast.hr$hr)
title("hr column")

That provides the following plot:

new plot

So now this is piecewise linear and continuous as stated in the article. Now the cumulative HR should look like the thick black line in Figure 3 of the article.

Now to continue I need to know exactly what the model is that they are using to be able to perform a parametric bootstrap. In the section Statistical Methods for Penetrance Analyses they state which model they use:

A mixed model was employed which incorporates an unmeasured polygenic factor to model the effect on breast cancer risk of a large number of unmeasured genes in addition to the measured major gene.[29] The polygenic part of this model was implemented via a hypergeometric polygenic model with four loci[30] and postulates a normally distributed random variable G for each person so that these variables are correlated within families (see section 8.9 of Lange et al.[31]). A woman's age at breast cancer diagnosis was modeled as a random variable whose HR was, for noncarriers, exp(G) times the Australian breast cancer incidence rate for 1992–2002[32] or, for carriers, the product of this HR multiplied by the age-specific HR. As in Antoniou et al.,[33] the variance of G was chosen to be 1.67, and the mean was chosen so that the average HR for noncarriers equaled the population incidence.

I think that we need more information about the data to be able to do a parametric bootstrap. If you think can fully specify the model, then I can try to help you!

**EDIT: ** So here I have changed the code to be able to plot the quantiles around the mean curve. You can see that it does not look exactly like Figure 3 in the paper, but that is probably because the inc variable is not continuous in age.

boot.sampling.dist <- matrix(0,5000,61)

R = matrix(cbind(1, -0.1080, -0.1080, 1), nrow = 2)
U = R * (c(0.2668, 0.3814) %*% t(c(0.2668, 0.3814)))
X <- mvrnorm(n=10000,mu=c(3.364,0.980),Sigma=U) 

for (j in 1:5000){

  a = X[j, 1]
  b = X[j, 2]

  hr = rep(NA, 61)
  inc = rep(NA, 61)
  age = c(20:80)

  for (i in 1:length(hr)) {
    if (age[i] < 30) 
    {
      hr[i] = exp(a)
      inc[i] = hr[i]*3.347955/100000
    }
    else if (age[i] >= 50) 
    {
      hr[i] = exp(b)
      inc[i] = hr[i]*279.0873/100000
    }
    else 
    {
      hr[i] = exp(a) + (age[i] - 30)*(exp(b)-exp(a))/20
      inc[i] = hr[i]*75.91559/100000
    }
  }
  cum.inc = cumsum(inc)
  cum.risk = 1 - exp(-cum.inc)

  boot.sampling.dist[j,] <- cum.risk

}

my.quantiles<-quantile(boot.sampling.dist,c(.025,0.975))

my.025Quants <- apply(boot.sampling.dist,2,quantile,c(.025))
my.975Quants <- apply(boot.sampling.dist,2,quantile,c(.975))
my.mean <- apply(boot.sampling.dist,2,mean)

plot(age,my.mean,lwd=2,type="l",ylim=c(0,1))
lines(age,my.025Quants,col="red")
lines(age,my.975Quants,col="red")
title("Cumulative Hazard Ratio")

Plot with bounds

Related Question