Solved – How to calculate confidence interval using parametric bootstrap

confidence intervalr

I was able to calculate the age-specific hazard ratio (HR) for breast cancer at age 40 and 60 years old. The HR was a continuous, piecewise linear function of age which was constant before 40 years, linear between ages 40-60 years, and constant after age 60 years.

I then calculates the age-specific cumulative risk from these HRs as 1-exp(-cumulative HR), and now want to calculate the corresponding confidence intervals. A paper "A PALB2 mutation associated with high risk of breast cancer" by Southey et al (2010) calculated CI by using a parametric bootstrap with 5000 replications. She explained that

5000 draws were taken from the normal distribution that the parameter
estimates would be expected to follow under asymptotic likelihood
theory. For each age, corresponding values of the cumulative risk were
calculated, and the 95% CI was taken to be the 2.5 and 97.5 percentile
of this sample.

I want to do this in R. I think it has to do with the library boot. But I don't quite understand the statistics behind this. Can you please clarify these to me?

Thank you.

Additional info: Here is the code i used to calculate cumulative risk. As you can see, the cumulative risk (at column cumrr) at age 70 is 60.04%. I want to find the confidence interval of the cumrr column.

a = 3.467
b = 3.610

hr = rep(NA, 61)
inc = rep(NA, 61)
age = c(20:80)

for (i in 1:length(hr)) {
  if (age[i] < 30) 
  {
    hr[i] = exp(a)
    inc[i] = hr[i]*3.347955/100000
  }
  else if (age[i] >= 50) 
  {
    hr[i] = exp(b)
    inc[i] = hr[i]*279.0873/100000
  }
  else 
  {
    hr[i] = exp(a) + (age[i] - 30)*(exp(b)-exp(a))/20
    inc[i] = hr[i]*75.91559/100000
  }
}
cum.inc = cumsum(inc)
cum.risk = 1 - exp(-cum.inc)

I would like to note that sd(a) = 0.2668 and sd(b) = 0.3814. The correlation matrix between a and b is

R = matrix(cbind(1, -0.1080, -0.1080, 1), nrow = 2)

I did try to do the following bootstrap, but I think the code was doing a nonparametric bootstrap instead.

boot.sampling.dist <- matrix(1, 5000)

R = matrix(cbind(1, -0.1080, -0.1080, 1), nrow = 2)
U = R * (c(0.2668, 0.3814) %*% t(c(0.2668, 0.3814)))
X <- rmvnorm(n=10000,mean=c(3.364,0.980),sigma=U) 

for (j in 1:5000){

  a = X[j, 1]
  b = X[j, 2]

  hr = rep(NA, 61)
  inc = rep(NA, 61)
  age = c(20:80)

  for (i in 1:length(hr)) {
    if (age[i] < 30) 
    {
      hr[i] = exp(a)
      inc[i] = hr[i]*3.347955/100000
    }
    else if (age[i] >= 50) 
    {
      hr[i] = exp(b)
      inc[i] = hr[i]*279.0873/100000
    }
    else 
    {
      hr[i] = exp(a) + (age[i] - 30)*(exp(b)-exp(a))/20
      inc[i] = hr[i]*75.91559/100000
    }
  }
  cum.inc = cumsum(inc)
  cum.risk = 1 - exp(-cum.inc)

  boot.sampling.dist[j] <- cum.risk[51]

}

my.quantiles<-quantile(boot.sampling.dist,c(.025,0.975))

Best Answer

Edit: Sorry about my confusion in the beginning... But I read over the article, which luckily is in open access if anyone want to chip in on this.

So I first have some general comments regarding the code that you provided, then some questions regarding the modelling.

In the article they state the following:

When testing for an age dependence of the HR, the model with a constant HR was compared to one where the HR was a continuous, piecewise linear function of age which was constant before age 40 years, linear between ages 40 and 60 years and constant after age 60 years.

So when I plot the hr column in your data frame I get the following:

plot of data

So, this is not continuous! So there is something that needs to be fixed here. I think that the code that you would want to use for that is the following:

hr = rep(NA, 51) #hazard ratio
age = c(20:70)
young <- 3.364*10*1.190/100000
old <- 0.980*10*279.0873/100000
for (i in 1:length(hr)) {
  if (age[i] < 30) hr[i] = young
  else if (age[i] >= 50) hr[i] = old
  else hr[i] = hr[i-1] + (1/20)*(old-young)
}

breast.hr = as.data.frame(cbind(age = 20:70, hr, 
                        cumhr= cumsum(hr), cumrr = 1 - exp(-cumsum(hr))))

plot(breast.hr$age,breast.hr$hr)
title("hr column")

That provides the following plot:

new plot

So now this is piecewise linear and continuous as stated in the article. Now the cumulative HR should look like the thick black line in Figure 3 of the article.

Now to continue I need to know exactly what the model is that they are using to be able to perform a parametric bootstrap. In the section Statistical Methods for Penetrance Analyses they state which model they use:

A mixed model was employed which incorporates an unmeasured polygenic factor to model the effect on breast cancer risk of a large number of unmeasured genes in addition to the measured major gene.[29] The polygenic part of this model was implemented via a hypergeometric polygenic model with four loci[30] and postulates a normally distributed random variable G for each person so that these variables are correlated within families (see section 8.9 of Lange et al.[31]). A woman's age at breast cancer diagnosis was modeled as a random variable whose HR was, for noncarriers, exp(G) times the Australian breast cancer incidence rate for 1992–2002[32] or, for carriers, the product of this HR multiplied by the age-specific HR. As in Antoniou et al.,[33] the variance of G was chosen to be 1.67, and the mean was chosen so that the average HR for noncarriers equaled the population incidence.

I think that we need more information about the data to be able to do a parametric bootstrap. If you think can fully specify the model, then I can try to help you!

**EDIT: ** So here I have changed the code to be able to plot the quantiles around the mean curve. You can see that it does not look exactly like Figure 3 in the paper, but that is probably because the inc variable is not continuous in age.

boot.sampling.dist <- matrix(0,5000,61)

R = matrix(cbind(1, -0.1080, -0.1080, 1), nrow = 2)
U = R * (c(0.2668, 0.3814) %*% t(c(0.2668, 0.3814)))
X <- mvrnorm(n=10000,mu=c(3.364,0.980),Sigma=U) 

for (j in 1:5000){

  a = X[j, 1]
  b = X[j, 2]

  hr = rep(NA, 61)
  inc = rep(NA, 61)
  age = c(20:80)

  for (i in 1:length(hr)) {
    if (age[i] < 30) 
    {
      hr[i] = exp(a)
      inc[i] = hr[i]*3.347955/100000
    }
    else if (age[i] >= 50) 
    {
      hr[i] = exp(b)
      inc[i] = hr[i]*279.0873/100000
    }
    else 
    {
      hr[i] = exp(a) + (age[i] - 30)*(exp(b)-exp(a))/20
      inc[i] = hr[i]*75.91559/100000
    }
  }
  cum.inc = cumsum(inc)
  cum.risk = 1 - exp(-cum.inc)

  boot.sampling.dist[j,] <- cum.risk

}

my.quantiles<-quantile(boot.sampling.dist,c(.025,0.975))

my.025Quants <- apply(boot.sampling.dist,2,quantile,c(.025))
my.975Quants <- apply(boot.sampling.dist,2,quantile,c(.975))
my.mean <- apply(boot.sampling.dist,2,mean)

plot(age,my.mean,lwd=2,type="l",ylim=c(0,1))
lines(age,my.025Quants,col="red")
lines(age,my.975Quants,col="red")
title("Cumulative Hazard Ratio")

Plot with bounds

Related Solutions

Bootstrap Statistics – How to Calculate Bootstrap-Based Confidence Intervals

The question is related to the fundamental construction of confidence intervals, and when it comes to bootstrapping, the answer depends upon which bootstrapping method that is used.

Consider the following setup: $\hat{\theta}$ is an estimator of a real valued parameter $\theta$ with (an estimated) standard deviation $\text{se}$, then a standard 95% confidence interval based on a normal $N(\theta, \text{se}^2)$ approximation is $$\hat{\theta} \pm 1.96 \text{se}.$$ This confidence interval is derived as the set of $\theta$'s that fulfill $$z_{1} \leq \hat{\theta} - \theta \leq z_2$$ where $z_1 = -1.96\text{se}$ is the 2.5% quantile and $z_2 = 1.96\text{se}$ is the 97.5% quantile for the $N(0, \text{se}^2)$-distribution. The interesting observation is that when rearranging the inequalities we get the confidence interval expressed as $$\{\theta \mid \hat{\theta} - z_2 \leq \theta \leq \hat{\theta} - z_1 \} = [\hat{\theta} - z_2, \hat{\theta} - z_1].$$ That is, it is the lower 2.5% quantile that determines the right end point and the upper 97.5% quantile that determines the left end point.

If the sampling distribution of $\hat{\theta}$ is right skewed compared to the normal approximation, what is then the appropriate action? If right-skewed means that the 97.5% quantile for the sampling distribution is $z_2 > 1.96\text{se}$, the appropriate action is to move the left end point further to the left. That is, if we stick to the standard construction above. A standard usage of the bootstrap is to estimate the sampling quantiles and then use them instead of $\pm 1.96 \text{se}$ in the construction above.

However, another standard construction used in bootstrapping is the percentile interval, which is $$[\hat{\theta} + z_1, \hat{\theta} + z_2].$$ in the terminology above. It is simply the interval from the 2.5% quantile to the 97.5% quantile for the sampling distribution of $\hat{\theta}.$ A right-skewed sampling distribution of $\hat{\theta}$ implies a right-skewed confidence interval. For the reasons mentioned above, this appears to me to be a counter-intuitive behavior of percentile intervals. But they have other virtues, and are, for instance, invariant under monotone parameter transformations.

The BCa (bias-corrected and accelerated) bootstrap intervals as introduced by Efron, see e.g. the paper Bootstrap Conﬁdence Intervals, improve upon the properties of percentile intervals. I can only guess (and google) the quote the OP post, but maybe BCa is the appropriate context. Citing Diciccio and Efron from the paper mentioned, page 193,

The following argument motivates the BCa definition (2.3), as well as the parameters $a$ and $z_0$. Suppose that there exists a monotone increasing transformation $\phi = m(\theta)$ such that $\hat{\phi} = m(\hat{\theta})$ is normally distributed for every choice of $\theta$, but possibly with a bias and a nonconstant variance, $$\hat{\phi} \sim N(\phi - z_0 \sigma_{\phi}, \sigma_{\phi}^2), \quad \sigma_{\phi} = 1 + a \phi.$$ Then (2.3) gives exactly accurate and correct conﬁdence limits for $\theta$ having observed $\hat{\theta}$.

where (2.3) is the definition of the BCa intervals. The quote posted by the OP may refer to the fact that BCa can shift confidence intervals with a right-skewed sampling distribution further to the right. It is difficult to tell if this is the "correct action" in a general sense, but according to Diciccio and Efron it is correct in the setup above in the sense of producing confidence intervals with the correct coverage. The existence of the monotone transformation $m$ is a little tricky, though.

Solved – Using R to calculate survival probabilities with time-varying covariates using an Andersen-Gill model

My line of thought goes like this. As per the AG model, each individual is represented by a counting process $N_i(t)$ with intensity $\lambda_i(t)$ which can be written as $\lambda_i(t) = \lambda_0(t) \exp(\beta' x_i(t))$. By this it is implicit that only the "current" value of $x_i$ matters for the intensity. The counting process grows when an individual has an event. I also assume independent censoring (actually stopping of the process). Note that these assumptions are implicit from the way that you fitted the model.

The probability of no events in $(a,b)$ (improperly, a "survival") is then $$ S_i(t|s) = \exp \left( -\int_s^t \lambda_0(u) \exp(\beta'x_i(u)) du \right) $$ Note that the probability of no events does not directly relate to the probability of one event, as the complementary event to "no events in a period" is "at least one event during a period".

Then say you are interested in $S(t|s_0)$. Then the idea would be to fix $x_i = x_i(s_0)$ (i.e. assume the covariates don't change after $s_0$), and we would get $$ S_i(t|s_0) = S_i(t) / S_i(s_0) $$

Since the part before $s_0$ cancels out, together with my assumption that only the current value of $x_i$ matters for the intensity, means that this is equal to $$ S_i(t|s_i) = \exp \left( -\exp(\beta'x_i(s_0)) \int_{s_0}^t \lambda_0(u)du \right) $$

Then in R you can do it like that (I give an example with a data set):

library(frailtypack)
data(readmission)

mod1 <- coxph(Surv(t.start, t.stop, event) ~ sex + charlson + cluster(id), data = readmission)

# set the covariate values at s_0
mycov <- data.frame(sex = "Female", charlson = "1-2")
sf <- survfit(mod1, newdata = mycov)

# with different s_0
par(mfrow=c(2,2))
time <- c(300, 500, 700, 1000)
for(i in time) {
  pos <- which.max(sf$time[sf$time <= i])
  S_s0 <- sf$surv[pos]

  with(sf, plot(time[pos:length(time)], surv[pos:length(surv)] / S_s0), type = "l")
}

Here you get the plots of the survival curves, which correspond to the values $(t, S(t|s_0))$ for $t\geq s_0$.

The other comments that I would give on this are the following: it is difficult to talk about the distribution of the next event. This is because in the AG formulation the time since previous event does not play any role. In other words, if you would like to take that into account, more complicated stochastic models should be used, where for example you include the "previous number of events" as a time-dependent covariate. This does complicate things a lot and the interpretation of the estimated quantities is most likely very difficult.

The second comment I have is about the nature of the time dependent covariates. Mostly, the AG works nicely with "external" covariates, such as air pollution, or something which is not directly measured on the subject ("external" of the recurrent event process). This is mostly because the first expression I wrote here is the probability of no events during $(a,b)$ relies on the assumption that the number of events in any given interval is Poisson distributed. This is true if the covariates are external. A discussion on this can be found in several textbooks, for example in Cook & Lawless at section 2.5. If your time-dependent covariates do depend on the recurrent event process, then it should be modeled jointly with the recurrent events process.

Best Answer

Related Solutions

Bootstrap Statistics – How to Calculate Bootstrap-Based Confidence Intervals

Solved – Using R to calculate survival probabilities with time-varying covariates using an Andersen-Gill model

Related Question