Solved – How does the boot package in R handle collecting bootstrap samples if strata are not specified but the function separates the dataset by strata

bootstraprstratification

My current understanding is
1) if strata are not specified, then boot randomly selects rows with replacement from the entire dataset. If dataset is actually stratified then boot would often return uneven sample sizes. In contrast 2) if strata are specified, then boot randomly selects rows with replacement from within each stratum and independent of the other strata. In this case boot would always return the same sample sizes.

My statistic is the ratio of the means from two different groups and thus I set up my boot call with strata, which I believe then collects bootstrap samples from within each group before running my function. However, to double check my understanding I ran boot with and without strata and was surprised to find the results very similar.  Am I misunderstanding how boot handles strata? Or would these approaches differ only rarely, such as when boot randomly collects a very uneven set of bootstrap samples?

Any explanation of why these two approaches produce similar results is appreciated.

Below is R code for a sample dataset that represents a true value of my statistic as 10. I've run boot with and without strata specified. I find the estimated confidence intervals are slightly narrower when the strata are specified, but I expected a more dramatic difference.

require(boot)
    # relative yield takes a matrix or dataframe and finds the ratio
    # of the means: treatmentMean/controlMean. 
    # data structure:
    # first column is strata, control = 1 and treatment = 2
    # second column is response, or the data to be bootstrapped
    rel.yield <- function(D,i) {
      trt <- D[i,1]
      resp <- D[i,2]
      mean(resp[trt==2]) / mean(resp[trt==1])
    }

    # some data that has a true rel.yield of 10
    sub.pop <- matrix(data = c(rep(1,15),rep(2,15),rnorm(15,2,1),rnorm(15,20,1)),
                      nrow = 30, ncol = 2, dimnames = list((1:30),c('trt','resp')))

    # with strata specified
    b <- boot(sub.pop, rel.yield, R = 1000, strata = sub.pop[,1])
    #without strata specified
    c <- boot(sub.pop, rel.yield, R = 1000)

    # note the distributions of t* are similar
    par(mfrow=c(1,2))
    hist(b$t)
    abline(v=mean(b$t))
    hist(c$t)
    abline(v=mean(c$t))

    # and the CI estimates are also similar
    boot.ci(b)
    boot.ci(c)

Best Answer

Your understanding 1) and 2) are correct; the strata option makes bootstrap samples within each strata independently, whereas the non-strata option makes bootstrap samples of all the data meaning that bootstrap samples will not contain the same number of samples from within each strata.

As to why they are similar; it is just a matter of sample size. When running the above example (seed 42 and 10,000 bootstrap repli), I obtain

  • Standard error of stratified bootstrap estimator: 0.8502739
  • Standard error of estimator: 0.8691592

After increasing the sample size to 10 (not 30), I obtain (with seed 42)

  • Standard error of stratified bootstrap estimator: 1.032356
  • Standard error of estimator: 1.131476
Related Question