Hypothesis Testing – Why Shift the Mean of a Bootstrap Distribution?

bootstraphypothesis testing

I'm wondering about why we do a particular thing (and not another) when conducting bootstrapped hypothesis tests.

My understanding of bootstrapped hypothesis tests is this (based on this helpful explanation):

  1. State a null and alternative hypothesis. H0: mean = 50 and Ha: mean <> 50, for example
  2. Collect a sample of data. Let's say the mean of this sample is 62.
  3. From each of the observations in our sample, subtract the mean of our sample. So in this example, subtract 62 from each observation. Then add the mean value under the null hypothesis to each of our observations. So in this example, add 50 to each of our observations. Now we've got a sample that's centered on our null hypothesis mean.
  4. Resample this shifted sample with replacement a bunch of times. Each time, recalculate the mean, so we have a bunch of bootstrapped means. This is our bootstrapped distribution under the null hypothesis.
  5. Compare our original observed sample mean (62) to our bootstrapped null distribution. See how many bootstrapped means are at least as extreme as 62 is. This gives us a p-value.

Ok here's my question:

Is it ok to do what I'm about to describe below? If not, why not? And if not, is there a version of this that's ok to do?

  1. State a null and alternative hypothesis. H0: mean = 50 and Ha: mean <> 50, for example. Same as before.
  2. Collect a sample of data. Let's say the mean of this sample is 62. Same as before
  3. Resample our original sample with replacement a bunch of times. Each time, recalculate the mean, so we have a bunch of bootstrapped means. This creates a bootstrapped distribution (but not one centered on the null hypothesis value). This is different from above
  4. Compare our null hypothesis mean (50) to our bootstrapped distribution (centered roughly on 62). See how many bootstrapped means are at least as extreme as 50 is. This gives us a p-value. This is different from above

Again: Is it ok to do that? If not, why not? And if not, is there a version of this that's ok to do?

Any help would be much appreciated. Happy to clarify anything. Thanks so much!

PS: I've read this related post, but my question is why it's not acceptable to use the empirical distribution as the reference distribution when calculating the pvalue.

Best Answer

The idea is to emulate the sampling distribution under the null hypothesis (from which you get an approximate p-value).

So you make a sample that's shaped like the one you have but with a mean like the hypothesized one and see how unusual the sample is relative to that (this is suitable for a shift alternative). This emulates as near as we can how the null sampling distribution would behave.

By doing it the way around you're suggesting you end up answering a quite different question to the one you're testing.

Imagine for example, we look at a sample from a right skewed distribution.

Let us further consider that we are interested in testing whether the population mean is 100 against the 1-tailed alternative that it's greater than 100, and we get the following observations:

 99.84 100.47  99.97 100.62 101.48 100.28 100.18 100.09  99.99  99.73

The sample mean is 100.265

Here's the two comparisons you'd be making under the two resampling schemes:

histograms under the two bootstrapping schemes, with a heavier tail on the right but the p-value under scheme 2 obtained by looking in the left tail

Under the correct scheme (the sampling distribution is as it was but with the hypothesized mean, how unusual is the sample), we look at an upper tail area; the upper tail is heavy, so this has a relatively high p-value.

Under your proposed scheme we have to look in the left tail, but it's shorter tailed, suggesting the hypothesized value is inconsistent. This has a lower p-value because we end up looking in the short tail, when it was the long right tail under the null that "produced" the high sample mean.

When the sample mean is above the hypothesized mean, it should tend to be far above, and when it's below it should tend to be close by. We have switched those two around and miscalculate.

In the case of a simple two-sided test where we look in both tails which we do should make no difference, but the difference in such a simple case as the one above makes it clear that this won't correspond in general. We should try to do it the right way around rather than relying on it working out in some cases.