Solved – How does the boot package in R handle collecting bootstrap samples if strata are not specified but the function separates the dataset by strata

bootstraprstratification

My current understanding is
1) if strata are not specified, then boot randomly selects rows with replacement from the entire dataset. If dataset is actually stratified then boot would often return uneven sample sizes. In contrast 2) if strata are specified, then boot randomly selects rows with replacement from within each stratum and independent of the other strata. In this case boot would always return the same sample sizes.

My statistic is the ratio of the means from two different groups and thus I set up my boot call with strata, which I believe then collects bootstrap samples from within each group before running my function. However, to double check my understanding I ran boot with and without strata and was surprised to find the results very similar. Am I misunderstanding how boot handles strata? Or would these approaches differ only rarely, such as when boot randomly collects a very uneven set of bootstrap samples?

Any explanation of why these two approaches produce similar results is appreciated.

Below is R code for a sample dataset that represents a true value of my statistic as 10. I've run boot with and without strata specified. I find the estimated confidence intervals are slightly narrower when the strata are specified, but I expected a more dramatic difference.

require(boot)
    # relative yield takes a matrix or dataframe and finds the ratio
    # of the means: treatmentMean/controlMean. 
    # data structure:
    # first column is strata, control = 1 and treatment = 2
    # second column is response, or the data to be bootstrapped
    rel.yield <- function(D,i) {
      trt <- D[i,1]
      resp <- D[i,2]
      mean(resp[trt==2]) / mean(resp[trt==1])
    }

    # some data that has a true rel.yield of 10
    sub.pop <- matrix(data = c(rep(1,15),rep(2,15),rnorm(15,2,1),rnorm(15,20,1)),
                      nrow = 30, ncol = 2, dimnames = list((1:30),c('trt','resp')))

    # with strata specified
    b <- boot(sub.pop, rel.yield, R = 1000, strata = sub.pop[,1])
    #without strata specified
    c <- boot(sub.pop, rel.yield, R = 1000)

    # note the distributions of t* are similar
    par(mfrow=c(1,2))
    hist(b$t)
    abline(v=mean(b$t))
    hist(c$t)
    abline(v=mean(c$t))

    # and the CI estimates are also similar
    boot.ci(b)
    boot.ci(c)

Best Answer

Your understanding 1) and 2) are correct; the strata option makes bootstrap samples within each strata independently, whereas the non-strata option makes bootstrap samples of all the data meaning that bootstrap samples will not contain the same number of samples from within each strata.

As to why they are similar; it is just a matter of sample size. When running the above example (seed 42 and 10,000 bootstrap repli), I obtain

Standard error of stratified bootstrap estimator: 0.8502739
Standard error of estimator: 0.8691592

After increasing the sample size to 10 (not 30), I obtain (with seed 42)

Standard error of stratified bootstrap estimator: 1.032356
Standard error of estimator: 1.131476

Related Solutions

Solved – Why is the bootstrap function for paired samples t test in R not returning the same result as SPSS

Your bootstrap function is not correct.

I know why all of your p values are between 0.4 and 0.6 and are averaging 0.5: half of your resamples give you a test statistic below and half of your resamples give you a test statistic above the original. You will always get that result from that function - I tried it out with some other data. You aren't randomly switching up the pre and post data.

To get the bootstrap p value, you compare your observed test statistic,

 t.test(x=data[,1], y=data[,2], paired=TRUE)$statistic

with a random shuffling of pre and post data. So, you need to sample from your original data AND randomly mix up pre and post data (maintaining pairs though).

I'll try to post some code later if you still need help.

Solved – Bootstrapping a t-test in R

I've never used the boot package. Bootstrapping is so trivial you can just code it from scratch. Below, I just use t.test() with the defaults; you can choose var.equal=T, alternative="greater", etc., if you'd like. I set the seed, so your results would be identical, if you don't do anything different. For the qq-plot for the t-distribution, I used the df that corresponds to equal variances, which won't quite match the bootstrap (where each iteration will have a different effective df). Under the null, p-values should be uniformly distributed, but yours clearly aren't. I'm not sure I'd draw any conclusions from that, though.

library(car)
white_matter <- read.table(text="   Control Patient
1   0.3329  0.3306
2   0.3458  0.3375
3   0.3500  0.3874
4   0.3680  0.3485
5   0.3421  0.3548
6   0.3403  0.3876
7   0.3447  0.3755
8   0.3330  0.3644
9   0.3450  0.3206
10  0.3764  0.3587
11  0.3646  0.3570
12  0.3482  0.3423
13  0.3734  0.3583
14  0.3436  0.3457
15  0.3348  0.3770
16  0.3553  0.3419
17  0.3281  0.3416
18  0.3567  0.3703
19  0.3390  0.3525
20  0.3287  0.3596
21  0.3603  0.3519
22  0.3533  0.3443", header=T)

set.seed(1315)
B      <- 1000
t.vect <- vector(length=B)
p.vect <- vector(length=B)
for(i in 1:B){
  boot.c <- sample(white_matter$Control, size=22, replace=T)
  boot.p <- sample(white_matter$Patient, size=22, replace=T)
  ttest  <- t.test(boot.c, boot.p)
  t.vect[i] <- ttest$statistic
  p.vect[i] <- ttest$p.value
}

windows()
  qqPlot(t.vect, distribution="t", df=42)

enter image description here

windows()
  qqPlot(p.vect, distribution="unif")

enter image description here

Best Answer

Related Solutions

Solved – Why is the bootstrap function for paired samples t test in R not returning the same result as SPSS

Solved – Bootstrapping a t-test in R

Related Question