Solved – Two methods of bootstrap significance tests

bootstrapp-valuestatistical significance

Using bootstrap I calculate p values of significance tests using two methods:

  1. resampling under the null hypothesis and counting the outcomes at least as extreme as the outcome coming from the original data
  2. resampling under the alternative hypothesis and counting the outcomes at least as distant from the original outcome as the value corresponding to the null hypothesis

I believe that the 1st approach is entirely correct as it follows the definition of a p value. I'm less sure about the second, but it usually gives very similar results and reminds me a Wald test.

Am I right? Are both methods correct? Are they identical (for large samples)?


Examples for the two methods (edits after DWin's questions and Erik's answer):

Example 1. Let's construct a bootstrap test similar to the two sample T test.
Method 1 will resample from one sample (obtained by pooling the original two).
Method 2 will resample from both samples independently.

Example 2. Let's construct a bootstrap test of correlation between x₁…xₐ and y₁…yₐ. Method 1 will assume no correlation and resample allowing for (xₑ,yₔ) pairs where e≠ə.
Method 2 will compile a bootstrap sample of the original (x,y) pairs.

Example 3. Let's construct a bootstrap test to check if a coin is fair.
Method 1 will create random samples setting Pr(head)=Pr(tail)=½.
Method 2 will resample the sample of experimental head/tail values and compare the proportions to ½.

Best Answer

The first approach is classical and trustworthy but can not always be used. To get bootstrap samples assuming the null hypothesis you must either be willing to assume a theoretical distribution to hold (this is your first option) or to assume that your statistic of interest has the same distributionally shape when shifted to the null hypothesis (your second option). For example, under the usual assumption the t-distribution has the same shape when shifted to another mean. However, when changing the null frequency of 0.5 of a binomial distribution to 0.025 will also change the shape.

In my experience, otherwise in the case that you are willing to make these assumptions you often also have other options. In your example 1) where you seem to assume that both samples could have come from the same base population a permutation test would be better in my opinion.

There is another option (which you seems to be your 2nd choice) which is based on bootstrap confidence intervals. Basically, this assumes that if your stated coverage holds that significance at a level of $\alpha$ is equivalent to the null hypothesis not being included in the $(1-\alpha)$-confidence interval. See for example, this question: What is the difference between confidence intervals and hypothesis testing?

This is a very flexible method and applicable for many tests. However, it is very critical to construct good bootstrap confidence intervals and not simply to use Wald-approximations or the percentile method. Some info is here: Bootstrap-based confidence interval