Hypothesis Testing – Two Methods of Using Bootstrapping to Test the Difference Between Two Sample Means

bootstraphypothesis testingt-test

I'd like to bootstrap test a hypothesis (two sample Student's t-test).
In Efron and Tibshirani 1993 p.224 there is explicit code for that: for each observation, subtract its group mean and add the overall mean where the overall mean is the mean of the combined samples. They claim that we should bootstrap distributions under null hypothesis and that's the reason why we should do this.

However, I also learnt that it's possible to bootstrap from the samples directly without modifying them. I tried both methods: Efron's steps (using the function boot_t_F) and also without transforming the observations (using function boot_t_B).

The resulting bootstrapped p-values (as the proportion of those bootstrapped test statistics that exceed original test statistic) should be exactly the same, but they are not.

Why is this?

My two functions are below:

 boot_t_B<-function(x,y){
  print(t.test(x, y, var.equal=TRUE)) #original test statistics
  t.est<-abs(t.test(x, y, var.equal=TRUE)$statistic) #Student's t-test

  grand_mean<-mean(c(x,y), na.rm=T) #global mean
  x1<-x #-mean(x, na.rm=T)+grand_mean it's not subtracted/added here
  y1<-y #-mean(y, na.rm=T)+grand_mean

  B      <- 10000 #number of bootstrap samples
  t.vect <- vector(length=B) #vector for bootstrapped t-statistics
  for(i in 1:B){
    boot.c <- sample(x1, size=length(x), replace=T)
    boot.p <- sample(y1, size=length(y), replace=T)
    t.vect[i] <- t.test(boot.c, boot.p, var.equal=TRUE)$statistic
  }
   return(mean(t.vect>t.est)) #bootstrapped p-value
} 


boot_t_F<-function(x,y){
  print(t.test(x, y, var.equal=TRUE)) #original test statistics
  t.est<-abs(t.test(x, y, var.equal=TRUE)$statistic) #Student's t-test

  grand_mean<-mean(c(x,y), na.rm=T) #global mean
  x1<-x-mean(x, na.rm=T)+grand_mean
  y1<-y-mean(y, na.rm=T)+grand_mean

  B      <- 10000 #number of bootstrap samples
  t.vect <- vector(length=B) #vector for bootstrapped test-statistics
  for(i in 1:B){
    boot.c <- sample(x1, size=length(x), replace=T)
    boot.p <- sample(y1, size=length(y), replace=T)
    t.vect[i] <- t.test(boot.c, boot.p, var.equal=TRUE)$statistic
  }

  return(mean(t.vect>t.est)) #bootstrapped p-value
} 

set.seed(1678)
boot_t_B(rnorm(25,0,10), rnorm(25,5,10))
[1] 4e-04
set.seed(1678)
boot_t_F(rnorm(25,0,10), rnorm(25,5,10))
[1] 0.0507

Note: I chose 'randomly' the (normal) distribution of the samples.

Best Answer

The issue is that your bootstrap in boot_t_B isn't correctly done. If you're not correcting the means to be the same (i.e., forcing the null hypothesis to be true by re-centering each sample), you force the null hypothesis to be true by sampling from the two samples combined:

boot.c <- sample(c(x1,y1), size=length(x), replace=T)
boot.p <- sample(c(x1,y1), size=length(y), replace=T)

The reason for this is that if the means ARE different, in your original formulation boot.c and boot.p are actually samples from the alternative hypothesis where the alternative distributions are "centered" at the data. You can think of it as bootstrap sampling from the alternative distribution that is most likely given the data, only you're being nonparametric instead of using a parametric bootstrap. Consequently, you don't get p-values, which of course are calculated assuming the null hypothesis.

If you do it this way, you get:

> set.seed(1678)
> boot_t_B(rnorm(25,0,10), rnorm(25,5,10))
[1] 0.05
> set.seed(1678)
> boot_t_F(rnorm(25,0,10), rnorm(25,5,10))
[1] 0.0507