Bootstrap Under H0 – Perform a Test for Difference of Two Means with Replacement within Groups or Pooled Sample

bootstraphypothesis testingpermutation-testrsmall-sample

Suppose that I have a data with two independent groups:

g1.lengths <- c (112.64, 97.10, 84.18, 106.96, 98.42, 101.66)

g2.lengths <- c (84.44, 82.10, 83.26, 81.02, 81.86, 86.80, 
                     85.84, 97.08, 79.64, 83.32, 91.04, 85.92,
                     73.52, 85.58, 97.70, 89.72, 88.92, 103.72,
                     105.02, 99.48, 89.50, 81.74)

group = rep (c ("g1", "g2"), c (length (g1.lengths), length (g2.lengths)))

lengths = data.frame( lengths = c(g1.lengths, g2.lengths), group)

It is evident that sample size per group is biased where g1 has 6 observations and g2 has 22. Traditional ANOVA suggests that groups have different means when critical value is set to 0.05 (p value is 0.0044).

summary (aov (lengths~group, data = lengths))

Given that my aim is to compare mean difference, such unbalanced and small sampled data might give inappropriate results with traditional approach. Therefore, I want to perform permutation test and bootstrap.

PERMUTATION TEST

Null hypothesis (H0) states that group's means are the same. This assumption in permutation test is justify by pooling groups into one sample. This ensures that the samples for two groups were drawn from the identical distribution. By repeated sampling (or more precisely – reshuffling) from the pooled data, the observations are reallocated (shuffled) to the samples in new way, and test statistic is calculated. Performing this n times, will give sampling distribution of the test statistics under the assumption where H0 is TRUE. At the end, under the H0, p value is probability that the test statistic equals or exceeds the observed value.

s.size.g1 <- length (g1.lengths)
s.size.g2 <- length (g2.lengths)

pool <- lengths$lengths
obs.diff.p <- mean (g1.lengths) - mean (g2.lengths)
iterations <- 10000
sampl.dist.p <- NULL

set.seed (5)
for (i in 1 : iterations) {
        resample <- sample (c(1:length (pool)), length(pool))

        g1.perm = pool[resample][1 : s.size.g1]
        g2.perm = pool[resample][(s.size.g1+1) : length(pool)]
        sampl.dist.p[i] = mean (g1.perm) - mean (g2.perm) 
}
p.permute <- (sum (abs (sampl.dist.p) >= abs(obs.diff.p)) + 1)/ (iterations+1)

Reported p value of permutation test is 0.0053. OK, if I did it correctly, permutations and parametric ANOVA give nearly identical results.

BOOTSTRAP

First of all, I am aware that bootstrap can not help when sample sizes is too small. This post showed that it can be even worse and misleading. Also, second one highlighted that permutation test is generally better than bootstrap when the hypothesis testing is the main aim. Nonetheless, this great post addresses important differences among computer-intensive methods. However, here I want to raise (I believe) a different question.

Let me introduce most common bootstrap approach first (Bootstrap1: resampling within the pooled sample):

s.size.g1 <- length (g1.lengths)
s.size.g2 <- length (g2.lengths)

pool <- lengths$lengths
obs.diff.b1 <- mean (g1.lengths) - mean (g2.lengths)
iterations <- 10000
sampl.dist.b1 <- NULL

set.seed (5)
for (i in 1 : iterations) {
        resample <- sample (c(1:length (pool)), length(pool), replace = TRUE) 
        # "replace = TRUE" is the only difference between bootstrap and permutations

        g1.perm = pool[resample][1 : s.size.g1]
        g2.perm = pool[resample][(s.size.g1+1) : length(pool)]
        sampl.dist.b1[i] = mean (g1.perm) - mean (g2.perm) 
}
p.boot1 <- (sum (abs (sampl.dist.b1) >= obs.diff.b1) + 1)/ (iterations+1)

P value of bootstrap performed in this way is 0.005. Even that this sounds reasonable and almost identical to parametric ANOVA and permutation test, is it appropriate to justify H0 in this bootstrap on the basis that we just pooled samples from which we drawn subsequent samples?

Different approach I found in several scientific papers. Specifically, I saw that researchers modify the data in order to meet H0 prior to bootstrap. Searching around, I have found very interesting post in CV where @jan.s explained unusual results of bootstrap in the post question where the aim was to compare two means. However, in that post it is not covered how to perform bootstrap when data are modified prior to bootstrap. Approach where the data is modified prior to bootstrap looks like this:

H0 states that the means of two groups are the same
H0 holds true if we subtract individual observations from the mean of pooled sample

In this case, modification of the data should affects the groups means, and therefore its difference, but not variation within (and between) groups.

Modified data will be basis for further bootstrap, with caveats that sampling is carried out within each group separately.
Difference between bootstrapped mean of g1 and g2 is calculated and compared with observed (non-modified) difference between groups.
Proportion of equal or more extreme values than observed one divided by number of iteration will give p value.

Here is the code (Bootstrap2: resampling within the groups after modification that H0 is TRUE):

s.size.g1 <- length (g1.lengths)
s.size.g2 <- length (g2.lengths)

pool <- lengths$lengths
obs.diff.b2 <- mean (g1.lengths) - mean (g2.lengths)

# make H0 to be true (no difference between means of two groups)
H0 <- pool - mean (pool)

# g1 from H0 
g1.H0 <- H0[1:s.size.g1] 

# g2 from H0
g2.H0 <- H0[(s.size.g1+1):length(pool)]

iterations <- 10000
sampl.dist.b2 <- NULL

set.seed (5)
for (i in 1 : iterations) {
        # Sample with replacement in g1
        g1.boot = sample (g1.H0, replace = T)

        # Sample with replacement in g2
        g2.boot = sample (g2.H0, replace = T)

        # bootstrapped difference
        sampl.dist.b2[i] <- mean (g1.boot) - mean (g2.boot)  
}
p.boot2 <- (sum (abs (sampl.dist.b2) >= obs.diff.b2) + 1)/ (iterations+1)

Such performed bootstrap will give p value of 0.514 which is tremendously different compared to previous tests. I believe that this has to deal with @jan.s' explanation, but I can not figure out where is the key…

Best Answer

Here is my take on it, based on chapter 16 of Efron's and Tibshirani's An Introduction to the bootstrap (page 220-224). The short of it is that your second bootstrap algorithm was implemented wrongly, but the general idea is correct.

When conducting bootstrap tests, one has to make sure that the re-sampling method generates data that corresponds to the null hypothesis. I'll use the sleep data in R to illustrate this post. Note that I am using the studentized test statistic rather than just the difference of means, which is recommended by the textbook.

The classical t-test, which uses an analytical result to obtain information about the sampling distribution of the t-statistic, yields the following result:

x <- sleep$extra[sleep$group==1]
y <- sleep$extra[sleep$group==2]
t.test(x,y)
t = -1.8608, df = 17.776, p-value = 0.07939

One approach is similar in spirit to the more well-known permutation test: samples are taken across the entire set of observations whilst ignoring the grouping labels. Then the first $n1$ are assigned to the first group and the remaining $n2$ to the second group.

# pooled sample, assumes equal variance
pooled <- c(x,y)
for (i in 1:10000){
  sample.index <- sample(c(1:length(pooled)),replace=TRUE)
  sample.x <- pooled[sample.index][1:length(x)]
  sample.y <- pooled[sample.index][-c(1:length(y))]
  boot.t[i] <- t.test(sample.x,sample.y)$statistic
}
p.pooled <-  (1 + sum(abs(boot.t) >= abs(t.test(x,y)$statistic))) / (10000+1) 
p.pooled
[1] 0.07929207

However, this algorithm is actually testing whether the distribution of x and y are identical. If we are simply interested in whether or not their population means are equal, without making any assumptions about their variance, we should generate data under $H_0$ in a slightly different manner. You were on the right track with your approach, but your translation to $H_0$ is a bit different from the one proposed in the textbook. To generate $H_0$ we need to subtract the first group's mean from the observations in the first group and then add the common or pooled mean $\bar{z}$. For the second group we do the same thing.

$$ \tilde{x}_i = x_i - \bar{x} + \bar{z} $$ $$ \tilde{y}_i = y_i - \bar{y} + \bar{z}$$

This becomes more intuitive when you calculate the means of the new variables $\tilde{x}/\tilde{y}$. By first subtracting their respective group means, the variables become centred around zero. By adding the overall mean $\bar{z}$ we end up with a sample of observations centred around the overall mean. In other words, we transformed the observations so that they have the same mean, which is also the overall mean of both groups together, which is exactly $H_0$.

# sample from H0 separately, no assumption about equal variance
xt <- x - mean(x) + mean(sleep$extra)
yt <- y - mean(y) + mean(sleep$extra)

boot.t <- c(1:10000)
for (i in 1:10000){
  sample.x <- sample(xt,replace=TRUE)
  sample.y <- sample(yt,replace=TRUE)
  boot.t[i] <- t.test(sample.x,sample.y)$statistic
}
p.h0 <-  (1 + sum(abs(boot.t) >= abs(t.test(x,y)$statistic))) / (10000+1) 
p.h0
[1] 0.08049195

This time around we ended up with similar p-values for the three approaches.

Related Solutions

Bootstrap – Conducting a Bootstrap Two-Sample t-Test

As @Tim notes, your bootsamples should have the same $n_j$s as your original data.

Next, recognize that there are several ways to bootstrap: e.g., you can bootstrap your data directly or bootstrap a test statistic, you can bootstrap your sampling distribution or a null distribution, etc. You need to make sure you understand which kind of thing you're doing. You can bootstrap simply the mean difference, if you want to. In the linked post, I bootstrapped the null distribution of the test statistic. That is essentially what you are doing in your code.

Also, because of the ways tests can differ, the bootstrapping strategy may need to be customized to the test you want to perform. In the linked post, I bootstrapped an $F$-statistic, but the way the $F$-test works is somewhat different from how a $t$-test works. Since you are bootstrapping the test statistic, you are somewhat safe from that.

In your case, think about the logic of the type of bootstrap you used. You bootstrapped a null sampling distribution for your $t$-statistic. Your observed $t$-statistic is so extreme that none of the bootstrapped $t$s overlapped with it. The implication of that is that the probability ($p$-value) of getting a $t$-statistic as far or further from $0$ from your bootstrapped null sampling distribution is $< (1/10000) / 2$. In other words, your result is highly significant. (However, you should re-do your bootstrap using the correct $n_j$s before you go with this result.)

Solved – distribution of ‘median difference’ in permutation test

I think the distribution of the median difference in permutation test does not follow a normal distribution. Here I give some counterexamples.

First, choosing $1,\cdots, 20$ as the group data points of 20. Then using permutation test with the median difference and the mean difference between two separate groups, each of which has 10 points. The result is shown in the following figure. Then we can see that distribution with the median difference is definitely not a normal distribution. It is also important to note that there is no zero in median difference and the smallest value is 1.

Second, I random sample 20 points from standard normal distribution. The result of permutation test is shown in the following figure. At last, I random sample 20 points from a uniform distribution. It shows similar. The result of permutation test is shown in the following figure.

In a nutshell, I didn't find that the distribution of the median difference in permutation test follows a normal distribution. I use a Gaussian distribution to fit the distribution of the median difference (black curve), then it is easy to see the difference.

Here I post my python code for further discussion.

Thanks.

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes=True)
%matplotlib inline  
import numpy as np
#data = np.linspace(1,20,20) # first case
#data = np.random.randn(20) # second case
data = np.random.rand(20) # last case
print(data)
res_median = []
res_mean = []
for i in range(10000):
    new_data = np.random.permutation(data)
    res_median.append(np.median(new_data[:10]) - np.median(new_data[10:]))
    res_mean.append(np.mean(new_data[:10]) - np.mean(new_data[10:]))

f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharex=True, sharey=True)

sns.distplot(res_median, 50, label="Median", ax=ax1)
sns.distplot(res_mean, 50, color='green', label="Mean", ax=ax2)
sns.distplot(res_median, 50, label="Median", ax=ax3)
sns.distplot(res_mean, 50, label="Mean", ax=ax3)
f.subplots_adjust(hspace=0)
plt.legend()
plt.savefig('res.pdf')
plt.show()

Best Answer

Related Solutions

Bootstrap – Conducting a Bootstrap Two-Sample t-Test

Solved – distribution of ‘median difference’ in permutation test

Related Question