This is the bootstrap analogy principle. The (unknown) underlying true distribution $F$ produced a sample at hand $x_1, \ldots, x_n$ with cdf $F_n$, which in turn produced the statistic $\hat\theta=T(F_n)$ for some functional $T(\cdot)$. Your idea of using the bootstrap is to make statements about the sampling distribution based on a known distribution $\tilde F$, where you try to use an identical sampling protocol (which is exactly possible only for i.i.d. data; dependent data always lead to limitations in how accurately one can reproduce the sampling process), and apply the same functional $T(\cdot)$. I demonstrated it in another post with (what I think is) a neat diagram. So the bootstrap analogue of the (sampling + systematic) deviation $\hat\theta - \theta_0$, the quantity of your central interest, is the deviation of the bootstrap replicate $\hat\theta^*$ from what is known to be true for the distribution $\tilde F$, the sampling process you applied, and the functional $T(\cdot)$, i.e. your measure of central tendency is $T(\tilde F)$. If you used the standard nonparametric bootstrap with replacement from the original data, your $\tilde F=F_n$, so your measure of the central tendency has to be $T(F_n) \equiv \hat
\theta$ based on the original data.
Besides the translation, there are subtler issues going on with the bootstrap tests which are sometimes difficult to overcome. The distribution of a test statistic under the null may be drastically different from the distribution of the test statistic under the alternative (e.g., in tests on the boundary of the parameter space which fail with the bootstrap). The simple tests you learn in undergraduate classes like $t$-test are invariant under shift, but thinking, "Heck, I just shift everything" fails once you have to move to the next level of conceptual complexity, the asymptotic $\chi^2$ tests. Think about this: you are testing that $\mu=0$, and your observed $\bar x=0.78$. Then when you construct a $\chi^2$ test $(\bar x-\mu)^2/(s^2/n) \equiv \bar x^2/(s^2/n)$ with the bootstrap analogue $\bar x_*^2/(s_*^2/n)$, then this test has a built-in non-centrality of $n \bar x^2/s^2$ from the outset, instead of being a central test as we would expect it to be. To make the bootstrap test central, you really have to subtract the original estimate.
The $\chi^2$ tests are unavoidable in multivariate contexts, ranging from Pearson $\chi^2$ for contingency tables to Bollen-Stine bootstrap of the test statistic in structural equation models. The concept of shifting the distribution is extremely difficult to define well in these situations... although in case of the tests on the multivariate covariance matrices, this is doable by an appropriate rotation.
Here is my take on it, based on chapter 16 of Efron's and Tibshirani's An Introduction to the bootstrap (page 220-224). The short of it is that your second bootstrap algorithm was implemented wrongly, but the general idea is correct.
When conducting bootstrap tests, one has to make sure that the re-sampling method generates data that corresponds to the null hypothesis. I'll use the sleep data in R to illustrate this post. Note that I am using the studentized test statistic rather than just the difference of means, which is recommended by the textbook.
The classical t-test, which uses an analytical result to obtain information about the sampling distribution of the t-statistic, yields the following result:
x <- sleep$extra[sleep$group==1]
y <- sleep$extra[sleep$group==2]
t.test(x,y)
t = -1.8608, df = 17.776, p-value = 0.07939
One approach is similar in spirit to the more well-known permutation test: samples are taken across the entire set of observations whilst ignoring the grouping labels. Then the first $n1$ are assigned to the first group and the remaining $n2$ to the second group.
# pooled sample, assumes equal variance
pooled <- c(x,y)
for (i in 1:10000){
sample.index <- sample(c(1:length(pooled)),replace=TRUE)
sample.x <- pooled[sample.index][1:length(x)]
sample.y <- pooled[sample.index][-c(1:length(y))]
boot.t[i] <- t.test(sample.x,sample.y)$statistic
}
p.pooled <- (1 + sum(abs(boot.t) >= abs(t.test(x,y)$statistic))) / (10000+1)
p.pooled
[1] 0.07929207
However, this algorithm is actually testing whether the distribution of x and y are identical. If we are simply interested in whether or not their population means are equal, without making any assumptions about their variance, we should generate data under $H_0$ in a slightly different manner. You were on the right track with your approach, but your translation to $H_0$ is a bit different from the one proposed in the textbook. To generate $H_0$ we need to subtract the first group's mean from the observations in the first group and then add the common or pooled mean $\bar{z}$. For the second group we do the same thing.
$$ \tilde{x}_i = x_i - \bar{x} + \bar{z} $$
$$ \tilde{y}_i = y_i - \bar{y} + \bar{z}$$
This becomes more intuitive when you calculate the means of the new variables $\tilde{x}/\tilde{y}$. By first subtracting their respective group means, the variables become centred around zero. By adding the overall mean $\bar{z}$ we end up with a sample of observations centred around the overall mean. In other words, we transformed the observations so that they have the same mean, which is also the overall mean of both groups together, which is exactly $H_0$.
# sample from H0 separately, no assumption about equal variance
xt <- x - mean(x) + mean(sleep$extra)
yt <- y - mean(y) + mean(sleep$extra)
boot.t <- c(1:10000)
for (i in 1:10000){
sample.x <- sample(xt,replace=TRUE)
sample.y <- sample(yt,replace=TRUE)
boot.t[i] <- t.test(sample.x,sample.y)$statistic
}
p.h0 <- (1 + sum(abs(boot.t) >= abs(t.test(x,y)$statistic))) / (10000+1)
p.h0
[1] 0.08049195
This time around we ended up with similar p-values for the three approaches.
Best Answer
The idea is to emulate the sampling distribution under the null hypothesis (from which you get an approximate p-value).
So you make a sample that's shaped like the one you have but with a mean like the hypothesized one and see how unusual the sample is relative to that (this is suitable for a shift alternative). This emulates as near as we can how the null sampling distribution would behave.
By doing it the way around you're suggesting you end up answering a quite different question to the one you're testing.
Imagine for example, we look at a sample from a right skewed distribution.
Let us further consider that we are interested in testing whether the population mean is 100 against the 1-tailed alternative that it's greater than 100, and we get the following observations:
The sample mean is 100.265
Here's the two comparisons you'd be making under the two resampling schemes:
Under the correct scheme (the sampling distribution is as it was but with the hypothesized mean, how unusual is the sample), we look at an upper tail area; the upper tail is heavy, so this has a relatively high p-value.
Under your proposed scheme we have to look in the left tail, but it's shorter tailed, suggesting the hypothesized value is inconsistent. This has a lower p-value because we end up looking in the short tail, when it was the long right tail under the null that "produced" the high sample mean.
When the sample mean is above the hypothesized mean, it should tend to be far above, and when it's below it should tend to be close by. We have switched those two around and miscalculate.
In the case of a simple two-sided test where we look in both tails which we do should make no difference, but the difference in such a simple case as the one above makes it clear that this won't correspond in general. We should try to do it the right way around rather than relying on it working out in some cases.