Hypothesis Testing – Why Shift the Mean of a Bootstrap Distribution?

bootstraphypothesis testing

I'm wondering about why we do a particular thing (and not another) when conducting bootstrapped hypothesis tests.

My understanding of bootstrapped hypothesis tests is this (based on this helpful explanation):

State a null and alternative hypothesis. H0: mean = 50 and Ha: mean <> 50, for example
Collect a sample of data. Let's say the mean of this sample is 62.
From each of the observations in our sample, subtract the mean of our sample. So in this example, subtract 62 from each observation. Then add the mean value under the null hypothesis to each of our observations. So in this example, add 50 to each of our observations. Now we've got a sample that's centered on our null hypothesis mean.
Resample this shifted sample with replacement a bunch of times. Each time, recalculate the mean, so we have a bunch of bootstrapped means. This is our bootstrapped distribution under the null hypothesis.
Compare our original observed sample mean (62) to our bootstrapped null distribution. See how many bootstrapped means are at least as extreme as 62 is. This gives us a p-value.

Ok here's my question:

Is it ok to do what I'm about to describe below? If not, why not? And if not, is there a version of this that's ok to do?

State a null and alternative hypothesis. H0: mean = 50 and Ha: mean <> 50, for example. Same as before.
Collect a sample of data. Let's say the mean of this sample is 62. Same as before
Resample our original sample with replacement a bunch of times. Each time, recalculate the mean, so we have a bunch of bootstrapped means. This creates a bootstrapped distribution (but not one centered on the null hypothesis value). This is different from above
Compare our null hypothesis mean (50) to our bootstrapped distribution (centered roughly on 62). See how many bootstrapped means are at least as extreme as 50 is. This gives us a p-value. This is different from above

Again: Is it ok to do that? If not, why not? And if not, is there a version of this that's ok to do?

Any help would be much appreciated. Happy to clarify anything. Thanks so much!

PS: I've read this related post, but my question is why it's not acceptable to use the empirical distribution as the reference distribution when calculating the pvalue.

Best Answer

The idea is to emulate the sampling distribution under the null hypothesis (from which you get an approximate p-value).

So you make a sample that's shaped like the one you have but with a mean like the hypothesized one and see how unusual the sample is relative to that (this is suitable for a shift alternative). This emulates as near as we can how the null sampling distribution would behave.

By doing it the way around you're suggesting you end up answering a quite different question to the one you're testing.

Imagine for example, we look at a sample from a right skewed distribution.

Let us further consider that we are interested in testing whether the population mean is 100 against the 1-tailed alternative that it's greater than 100, and we get the following observations:

 99.84 100.47  99.97 100.62 101.48 100.28 100.18 100.09  99.99  99.73

The sample mean is 100.265

Here's the two comparisons you'd be making under the two resampling schemes:

Under the correct scheme (the sampling distribution is as it was but with the hypothesized mean, how unusual is the sample), we look at an upper tail area; the upper tail is heavy, so this has a relatively high p-value.

Under your proposed scheme we have to look in the left tail, but it's shorter tailed, suggesting the hypothesized value is inconsistent. This has a lower p-value because we end up looking in the short tail, when it was the long right tail under the null that "produced" the high sample mean.

When the sample mean is above the hypothesized mean, it should tend to be far above, and when it's below it should tend to be close by. We have switched those two around and miscalculate.

In the case of a simple two-sided test where we look in both tails which we do should make no difference, but the difference in such a simple case as the one above makes it clear that this won't correspond in general. We should try to do it the right way around rather than relying on it working out in some cases.

Related Solutions

Bootstrap – Why Resample Under Null Hypothesis in Hypothesis Testing

This is the bootstrap analogy principle. The (unknown) underlying true distribution $F$ produced a sample at hand $x_1, \ldots, x_n$ with cdf $F_n$, which in turn produced the statistic $\hat\theta=T(F_n)$ for some functional $T(\cdot)$. Your idea of using the bootstrap is to make statements about the sampling distribution based on a known distribution $\tilde F$, where you try to use an identical sampling protocol (which is exactly possible only for i.i.d. data; dependent data always lead to limitations in how accurately one can reproduce the sampling process), and apply the same functional $T(\cdot)$. I demonstrated it in another post with (what I think is) a neat diagram. So the bootstrap analogue of the (sampling + systematic) deviation $\hat\theta - \theta_0$, the quantity of your central interest, is the deviation of the bootstrap replicate $\hat\theta^*$ from what is known to be true for the distribution $\tilde F$, the sampling process you applied, and the functional $T(\cdot)$, i.e. your measure of central tendency is $T(\tilde F)$. If you used the standard nonparametric bootstrap with replacement from the original data, your $\tilde F=F_n$, so your measure of the central tendency has to be $T(F_n) \equiv \hat \theta$ based on the original data.

Besides the translation, there are subtler issues going on with the bootstrap tests which are sometimes difficult to overcome. The distribution of a test statistic under the null may be drastically different from the distribution of the test statistic under the alternative (e.g., in tests on the boundary of the parameter space which fail with the bootstrap). The simple tests you learn in undergraduate classes like $t$-test are invariant under shift, but thinking, "Heck, I just shift everything" fails once you have to move to the next level of conceptual complexity, the asymptotic $\chi^2$ tests. Think about this: you are testing that $\mu=0$, and your observed $\bar x=0.78$. Then when you construct a $\chi^2$ test $(\bar x-\mu)^2/(s^2/n) \equiv \bar x^2/(s^2/n)$ with the bootstrap analogue $\bar x_*^2/(s_*^2/n)$, then this test has a built-in non-centrality of $n \bar x^2/s^2$ from the outset, instead of being a central test as we would expect it to be. To make the bootstrap test central, you really have to subtract the original estimate.

The $\chi^2$ tests are unavoidable in multivariate contexts, ranging from Pearson $\chi^2$ for contingency tables to Bollen-Stine bootstrap of the test statistic in structural equation models. The concept of shifting the distribution is extremely difficult to define well in these situations... although in case of the tests on the multivariate covariance matrices, this is doable by an appropriate rotation.

Bootstrap Under H0 – Perform a Test for Difference of Two Means with Replacement within Groups or Pooled Sample

Here is my take on it, based on chapter 16 of Efron's and Tibshirani's An Introduction to the bootstrap (page 220-224). The short of it is that your second bootstrap algorithm was implemented wrongly, but the general idea is correct.

When conducting bootstrap tests, one has to make sure that the re-sampling method generates data that corresponds to the null hypothesis. I'll use the sleep data in R to illustrate this post. Note that I am using the studentized test statistic rather than just the difference of means, which is recommended by the textbook.

The classical t-test, which uses an analytical result to obtain information about the sampling distribution of the t-statistic, yields the following result:

x <- sleep$extra[sleep$group==1]
y <- sleep$extra[sleep$group==2]
t.test(x,y)
t = -1.8608, df = 17.776, p-value = 0.07939

One approach is similar in spirit to the more well-known permutation test: samples are taken across the entire set of observations whilst ignoring the grouping labels. Then the first $n1$ are assigned to the first group and the remaining $n2$ to the second group.

# pooled sample, assumes equal variance
pooled <- c(x,y)
for (i in 1:10000){
  sample.index <- sample(c(1:length(pooled)),replace=TRUE)
  sample.x <- pooled[sample.index][1:length(x)]
  sample.y <- pooled[sample.index][-c(1:length(y))]
  boot.t[i] <- t.test(sample.x,sample.y)$statistic
}
p.pooled <-  (1 + sum(abs(boot.t) >= abs(t.test(x,y)$statistic))) / (10000+1) 
p.pooled
[1] 0.07929207

However, this algorithm is actually testing whether the distribution of x and y are identical. If we are simply interested in whether or not their population means are equal, without making any assumptions about their variance, we should generate data under $H_0$ in a slightly different manner. You were on the right track with your approach, but your translation to $H_0$ is a bit different from the one proposed in the textbook. To generate $H_0$ we need to subtract the first group's mean from the observations in the first group and then add the common or pooled mean $\bar{z}$. For the second group we do the same thing.

$$ \tilde{x}_i = x_i - \bar{x} + \bar{z} $$ $$ \tilde{y}_i = y_i - \bar{y} + \bar{z}$$

This becomes more intuitive when you calculate the means of the new variables $\tilde{x}/\tilde{y}$. By first subtracting their respective group means, the variables become centred around zero. By adding the overall mean $\bar{z}$ we end up with a sample of observations centred around the overall mean. In other words, we transformed the observations so that they have the same mean, which is also the overall mean of both groups together, which is exactly $H_0$.

# sample from H0 separately, no assumption about equal variance
xt <- x - mean(x) + mean(sleep$extra)
yt <- y - mean(y) + mean(sleep$extra)

boot.t <- c(1:10000)
for (i in 1:10000){
  sample.x <- sample(xt,replace=TRUE)
  sample.y <- sample(yt,replace=TRUE)
  boot.t[i] <- t.test(sample.x,sample.y)$statistic
}
p.h0 <-  (1 + sum(abs(boot.t) >= abs(t.test(x,y)$statistic))) / (10000+1) 
p.h0
[1] 0.08049195

This time around we ended up with similar p-values for the three approaches.

Best Answer

Related Solutions

Bootstrap – Why Resample Under Null Hypothesis in Hypothesis Testing

Bootstrap Under H0 – Perform a Test for Difference of Two Means with Replacement within Groups or Pooled Sample

Related Question