Suppose that I have a data with two independent groups:

```
g1.lengths <- c (112.64, 97.10, 84.18, 106.96, 98.42, 101.66)
g2.lengths <- c (84.44, 82.10, 83.26, 81.02, 81.86, 86.80,
85.84, 97.08, 79.64, 83.32, 91.04, 85.92,
73.52, 85.58, 97.70, 89.72, 88.92, 103.72,
105.02, 99.48, 89.50, 81.74)
group = rep (c ("g1", "g2"), c (length (g1.lengths), length (g2.lengths)))
lengths = data.frame( lengths = c(g1.lengths, g2.lengths), group)
```

It is evident that sample size per group is biased where **g1 has 6** observations and **g2 has 22**. Traditional ANOVA suggests that groups have different means when critical value is set to 0.05 (p value is **0.0044**).

```
summary (aov (lengths~group, data = lengths))
```

Given that my aim is to compare mean difference, such unbalanced and small sampled data might give inappropriate results with traditional approach. Therefore, I want to perform permutation test and bootstrap.

**PERMUTATION TEST**

Null hypothesis (H0) states that group's means are the same. This assumption in permutation test is justify by pooling groups into one sample. This ensures that the samples for two groups were drawn from the identical distribution. By repeated sampling (or more precisely – reshuffling) from the pooled data, the observations are reallocated (shuffled) to the samples in new way, and test statistic is calculated. Performing this n times, will give sampling distribution of the test statistics under the assumption where H0 is TRUE. At the end, under the H0, p value is probability that the test statistic equals or exceeds the observed value.

```
s.size.g1 <- length (g1.lengths)
s.size.g2 <- length (g2.lengths)
pool <- lengths$lengths
obs.diff.p <- mean (g1.lengths) - mean (g2.lengths)
iterations <- 10000
sampl.dist.p <- NULL
set.seed (5)
for (i in 1 : iterations) {
resample <- sample (c(1:length (pool)), length(pool))
g1.perm = pool[resample][1 : s.size.g1]
g2.perm = pool[resample][(s.size.g1+1) : length(pool)]
sampl.dist.p[i] = mean (g1.perm) - mean (g2.perm)
}
p.permute <- (sum (abs (sampl.dist.p) >= abs(obs.diff.p)) + 1)/ (iterations+1)
```

Reported p value of permutation test is **0.0053**. OK, if I did it correctly, permutations and parametric ANOVA give nearly identical results.

**BOOTSTRAP**

First of all, I am aware that bootstrap can not help when sample sizes is too small. This post showed that it can be even worse and misleading. Also, second one highlighted that permutation test is generally better than bootstrap when the hypothesis testing is the main aim. Nonetheless, this great post addresses important differences among computer-intensive methods. However, here I want to raise (I believe) a different question.

Let me introduce most common bootstrap approach first (Bootstrap1: **resampling within the pooled sample**):

```
s.size.g1 <- length (g1.lengths)
s.size.g2 <- length (g2.lengths)
pool <- lengths$lengths
obs.diff.b1 <- mean (g1.lengths) - mean (g2.lengths)
iterations <- 10000
sampl.dist.b1 <- NULL
set.seed (5)
for (i in 1 : iterations) {
resample <- sample (c(1:length (pool)), length(pool), replace = TRUE)
# "replace = TRUE" is the only difference between bootstrap and permutations
g1.perm = pool[resample][1 : s.size.g1]
g2.perm = pool[resample][(s.size.g1+1) : length(pool)]
sampl.dist.b1[i] = mean (g1.perm) - mean (g2.perm)
}
p.boot1 <- (sum (abs (sampl.dist.b1) >= obs.diff.b1) + 1)/ (iterations+1)
```

P value of bootstrap performed in this way is **0.005**. Even that this sounds reasonable and almost identical to parametric ANOVA and permutation test, is it appropriate to justify H0 in this bootstrap on the basis that we **just pooled samples** from which we drawn subsequent samples?

Different approach I found in several scientific papers. Specifically, I saw that researchers modify the data in order to meet H0 prior to bootstrap. Searching around, I have found very interesting post in CV where @jan.s explained unusual results of bootstrap in the post question where the aim was to compare two means. However, in that post it is not covered how to perform bootstrap when data are modified prior to bootstrap. Approach where the data is modified prior to bootstrap looks like this:

- H0 states that the means of two groups are the same
- H0 holds true if we subtract individual observations from the mean of pooled sample

In this case, modification of the data should affects the groups means, and therefore its difference, but not variation within (and between) groups.

- Modified data will be basis for further bootstrap, with caveats that sampling is carried out
.*within each group separately* - Difference between bootstrapped mean of g1 and g2 is calculated and compared with observed (non-modified) difference between groups.
- Proportion of equal or more extreme values than observed one divided by number of iteration will give p value.

Here is the code (Bootstrap2: **resampling within the groups after modification that H0 is TRUE**):

```
s.size.g1 <- length (g1.lengths)
s.size.g2 <- length (g2.lengths)
pool <- lengths$lengths
obs.diff.b2 <- mean (g1.lengths) - mean (g2.lengths)
# make H0 to be true (no difference between means of two groups)
H0 <- pool - mean (pool)
# g1 from H0
g1.H0 <- H0[1:s.size.g1]
# g2 from H0
g2.H0 <- H0[(s.size.g1+1):length(pool)]
iterations <- 10000
sampl.dist.b2 <- NULL
set.seed (5)
for (i in 1 : iterations) {
# Sample with replacement in g1
g1.boot = sample (g1.H0, replace = T)
# Sample with replacement in g2
g2.boot = sample (g2.H0, replace = T)
# bootstrapped difference
sampl.dist.b2[i] <- mean (g1.boot) - mean (g2.boot)
}
p.boot2 <- (sum (abs (sampl.dist.b2) >= obs.diff.b2) + 1)/ (iterations+1)
```

Such performed bootstrap will give p value of **0.514** which is tremendously different compared to previous tests. I believe that this has to deal with @jan.s' explanation, but I can not figure out where is the key…

## Best Answer

Here is my take on it, based on chapter 16 of Efron's and Tibshirani's

An Introduction to the bootstrap(page 220-224). The short of it is that your second bootstrap algorithm was implemented wrongly, but the general idea is correct.When conducting bootstrap tests, one has to make sure that the re-sampling method generates data that corresponds to the null hypothesis. I'll use the sleep data in R to illustrate this post. Note that I am using the studentized test statistic rather than just the difference of means, which is recommended by the textbook.

The classical t-test, which uses an analytical result to obtain information about the sampling distribution of the t-statistic, yields the following result:

One approach is similar in spirit to the more well-known permutation test: samples are taken across the entire set of observations whilst ignoring the grouping labels. Then the first $n1$ are assigned to the first group and the remaining $n2$ to the second group.

However, this algorithm is actually testing whether the distribution of x and y are identical. If we are simply interested in whether or not their population means are equal, without making any assumptions about their variance, we should generate data under $H_0$ in a slightly different manner. You were on the right track with your approach, but your translation to $H_0$ is a bit different from the one proposed in the textbook. To generate $H_0$ we need to subtract the first group's mean from the observations in the first group and then add the common or pooled mean $\bar{z}$. For the second group we do the same thing.

$$ \tilde{x}_i = x_i - \bar{x} + \bar{z} $$ $$ \tilde{y}_i = y_i - \bar{y} + \bar{z}$$

This becomes more intuitive when you calculate the means of the new variables $\tilde{x}/\tilde{y}$. By first subtracting their respective group means, the variables become centred around zero. By adding the overall mean $\bar{z}$ we end up with a sample of observations centred around the overall mean. In other words, we transformed the observations so that they have the same mean, which is also the overall mean of both groups together, which is exactly $H_0$.

This time around we ended up with similar p-values for the three approaches.