Here is my take on it, based on chapter 16 of Efron's and Tibshirani's An Introduction to the bootstrap (page 220-224). The short of it is that your second bootstrap algorithm was implemented wrongly, but the general idea is correct.
When conducting bootstrap tests, one has to make sure that the re-sampling method generates data that corresponds to the null hypothesis. I'll use the sleep data in R to illustrate this post. Note that I am using the studentized test statistic rather than just the difference of means, which is recommended by the textbook.
The classical t-test, which uses an analytical result to obtain information about the sampling distribution of the t-statistic, yields the following result:
x <- sleep$extra[sleep$group==1]
y <- sleep$extra[sleep$group==2]
t.test(x,y)
t = -1.8608, df = 17.776, p-value = 0.07939
One approach is similar in spirit to the more well-known permutation test: samples are taken across the entire set of observations whilst ignoring the grouping labels. Then the first $n1$ are assigned to the first group and the remaining $n2$ to the second group.
# pooled sample, assumes equal variance
pooled <- c(x,y)
for (i in 1:10000){
sample.index <- sample(c(1:length(pooled)),replace=TRUE)
sample.x <- pooled[sample.index][1:length(x)]
sample.y <- pooled[sample.index][-c(1:length(y))]
boot.t[i] <- t.test(sample.x,sample.y)$statistic
}
p.pooled <- (1 + sum(abs(boot.t) >= abs(t.test(x,y)$statistic))) / (10000+1)
p.pooled
[1] 0.07929207
However, this algorithm is actually testing whether the distribution of x and y are identical. If we are simply interested in whether or not their population means are equal, without making any assumptions about their variance, we should generate data under $H_0$ in a slightly different manner. You were on the right track with your approach, but your translation to $H_0$ is a bit different from the one proposed in the textbook. To generate $H_0$ we need to subtract the first group's mean from the observations in the first group and then add the common or pooled mean $\bar{z}$. For the second group we do the same thing.
$$ \tilde{x}_i = x_i - \bar{x} + \bar{z} $$
$$ \tilde{y}_i = y_i - \bar{y} + \bar{z}$$
This becomes more intuitive when you calculate the means of the new variables $\tilde{x}/\tilde{y}$. By first subtracting their respective group means, the variables become centred around zero. By adding the overall mean $\bar{z}$ we end up with a sample of observations centred around the overall mean. In other words, we transformed the observations so that they have the same mean, which is also the overall mean of both groups together, which is exactly $H_0$.
# sample from H0 separately, no assumption about equal variance
xt <- x - mean(x) + mean(sleep$extra)
yt <- y - mean(y) + mean(sleep$extra)
boot.t <- c(1:10000)
for (i in 1:10000){
sample.x <- sample(xt,replace=TRUE)
sample.y <- sample(yt,replace=TRUE)
boot.t[i] <- t.test(sample.x,sample.y)$statistic
}
p.h0 <- (1 + sum(abs(boot.t) >= abs(t.test(x,y)$statistic))) / (10000+1)
p.h0
[1] 0.08049195
This time around we ended up with similar p-values for the three approaches.
Whether or not you take the combinations or permutations doesn't actually affect your results, as the number of permutations of $n_{A}$ specific objects in $A$ and $n_{B}$ specific objects in $B$ is the same for all combinations of $x_{1} ... x_{n_{A}}$ and $x_{n_{A+1}} ... x_{n_{A} + n_{B}}$ since the size of each set doesn't change.
That is, for each and any given combination, you will get $n_{A}! \times n_{B}!$ times as many permutations than combinations regardless of the values inside each set. And as the value of the result (the difference between group means) does not change between permutations of the same combination the frequency of each specific result will be scaled equally when taking the permutation. So when calculating the quantiles practically it makes no difference using combinations or permutations. In fact you empirically proved it for the case of $n_{A} = 1$ and $n_{b} = 2$, the frequency of each result, $D = 0,2,4$, is just scaled by $2$ when taking permutations resulting in the quantile values being the same.
Let's assume the standard scenario where samples are independent, and we want to test if two samples come from the same distribution (null hypothesis) based on the difference in sample means
To be technical if you want to test this specific hypothesis I think it is more strictly "correct" to take the complete set of permutations (not combinations) of each set, as the distribution assumption under the null that group labels don't matter, is essentially allowing each $x_{i.}$ to take every value in the presence of every other $x_{j \neq i.}$, which combinations do not allow for.
But again, the results of the quantiles for the empirical distribution are the same since the frequency of each result is just scaled by the same amount $n_{A}! \times n_{B}!$, so practically it doesn't matter.
Best Answer
Often there are several statistics that will all result in the same p-value/result. For example in a 2 sample case the difference of the 2 means, the mean of group A, and the sum of the values in group A will all result in the same p-value (this is because given the data values and sample sizes you can calculate the 1st 2 given only the 3rd). I would expect the t statistic to be similar to any of the above, but may not be exactly the same (due to the dividing by standard deviation(s)). There are other statistics that could be very different in the results, possibly the difference of the 2 medians, or the ratio of the 2 variances. These other statistics will be affected differently by the permutation process.
Your choice should be based on a combination of what is most interesting based on the science and question being asked (sometimes medians might be of more interest, other times means would be) and what will give you power to detect a difference in reasonable/meaningful alternatives. You can test this later by simulating data from cases that you think likely or interesting and watching how the statistics perform.