Here is my take on it, based on chapter 16 of Efron's and Tibshirani's An Introduction to the bootstrap (page 220-224). The short of it is that your second bootstrap algorithm was implemented wrongly, but the general idea is correct.
When conducting bootstrap tests, one has to make sure that the re-sampling method generates data that corresponds to the null hypothesis. I'll use the sleep data in R to illustrate this post. Note that I am using the studentized test statistic rather than just the difference of means, which is recommended by the textbook.
The classical t-test, which uses an analytical result to obtain information about the sampling distribution of the t-statistic, yields the following result:
x <- sleep$extra[sleep$group==1]
y <- sleep$extra[sleep$group==2]
t.test(x,y)
t = -1.8608, df = 17.776, p-value = 0.07939
One approach is similar in spirit to the more well-known permutation test: samples are taken across the entire set of observations whilst ignoring the grouping labels. Then the first $n1$ are assigned to the first group and the remaining $n2$ to the second group.
# pooled sample, assumes equal variance
pooled <- c(x,y)
for (i in 1:10000){
sample.index <- sample(c(1:length(pooled)),replace=TRUE)
sample.x <- pooled[sample.index][1:length(x)]
sample.y <- pooled[sample.index][-c(1:length(y))]
boot.t[i] <- t.test(sample.x,sample.y)$statistic
}
p.pooled <- (1 + sum(abs(boot.t) >= abs(t.test(x,y)$statistic))) / (10000+1)
p.pooled
[1] 0.07929207
However, this algorithm is actually testing whether the distribution of x and y are identical. If we are simply interested in whether or not their population means are equal, without making any assumptions about their variance, we should generate data under $H_0$ in a slightly different manner. You were on the right track with your approach, but your translation to $H_0$ is a bit different from the one proposed in the textbook. To generate $H_0$ we need to subtract the first group's mean from the observations in the first group and then add the common or pooled mean $\bar{z}$. For the second group we do the same thing.
$$ \tilde{x}_i = x_i - \bar{x} + \bar{z} $$
$$ \tilde{y}_i = y_i - \bar{y} + \bar{z}$$
This becomes more intuitive when you calculate the means of the new variables $\tilde{x}/\tilde{y}$. By first subtracting their respective group means, the variables become centred around zero. By adding the overall mean $\bar{z}$ we end up with a sample of observations centred around the overall mean. In other words, we transformed the observations so that they have the same mean, which is also the overall mean of both groups together, which is exactly $H_0$.
# sample from H0 separately, no assumption about equal variance
xt <- x - mean(x) + mean(sleep$extra)
yt <- y - mean(y) + mean(sleep$extra)
boot.t <- c(1:10000)
for (i in 1:10000){
sample.x <- sample(xt,replace=TRUE)
sample.y <- sample(yt,replace=TRUE)
boot.t[i] <- t.test(sample.x,sample.y)$statistic
}
p.h0 <- (1 + sum(abs(boot.t) >= abs(t.test(x,y)$statistic))) / (10000+1)
p.h0
[1] 0.08049195
This time around we ended up with similar p-values for the three approaches.
Best Answer
That's a nested design - teams are nested under groups - and you probably want to consider 'team' as a random rather than a fixed effect. So look at hierarchical mixed models.