Solved – Two methods of bootstrap significance tests

bootstrapp-valuestatistical significance

Using bootstrap I calculate p values of significance tests using two methods:

resampling under the null hypothesis and counting the outcomes at least as extreme as the outcome coming from the original data
resampling under the alternative hypothesis and counting the outcomes at least as distant from the original outcome as the value corresponding to the null hypothesis

I believe that the 1^st approach is entirely correct as it follows the definition of a p value. I'm less sure about the second, but it usually gives very similar results and reminds me a Wald test.

Am I right? Are both methods correct? Are they identical (for large samples)?

^{Examples for the two methods (edits after DWin's questions and Erik's answer):}

Example 1. Let's construct a bootstrap test similar to the two sample T test.
Method 1 will resample from one sample (obtained by pooling the original two).
Method 2 will resample from both samples independently.

Example 2. Let's construct a bootstrap test of correlation between x₁…xₐ and y₁…yₐ. Method 1 will assume no correlation and resample allowing for (xₑ,yₔ) pairs where e≠ə.
Method 2 will compile a bootstrap sample of the original (x,y) pairs.

Example 3. Let's construct a bootstrap test to check if a coin is fair.
Method 1 will create random samples setting Pr(head)=Pr(tail)=½.
Method 2 will resample the sample of experimental head/tail values and compare the proportions to ½.

Best Answer

The first approach is classical and trustworthy but can not always be used. To get bootstrap samples assuming the null hypothesis you must either be willing to assume a theoretical distribution to hold (this is your first option) or to assume that your statistic of interest has the same distributionally shape when shifted to the null hypothesis (your second option). For example, under the usual assumption the t-distribution has the same shape when shifted to another mean. However, when changing the null frequency of 0.5 of a binomial distribution to 0.025 will also change the shape.

In my experience, otherwise in the case that you are willing to make these assumptions you often also have other options. In your example 1) where you seem to assume that both samples could have come from the same base population a permutation test would be better in my opinion.

There is another option (which you seems to be your 2nd choice) which is based on bootstrap confidence intervals. Basically, this assumes that if your stated coverage holds that significance at a level of $\alpha$ is equivalent to the null hypothesis not being included in the $(1-\alpha)$-confidence interval. See for example, this question: What is the difference between confidence intervals and hypothesis testing?

This is a very flexible method and applicable for many tests. However, it is very critical to construct good bootstrap confidence intervals and not simply to use Wald-approximations or the percentile method. Some info is here: Bootstrap-based confidence interval

Related Solutions

Solved – Bootstrap and randomization tests to compare paired data sets

Your permutation test is correct for testing if there is a difference in the quality of the 2 models. I don't understand enough about your bootstrap approach to know if it is correct or not. Another book to consider is "Bootstrap Methods and their Application" by Davison and Hinkley. It is a bit more recent and I believe more applied than Efron and Tibshirani.

Bootstrap for Regression – Two Ways to Estimate Confidence Intervals of Coefficients

If the response-predictor pairs have been obtained from a population by random sample, it is safe to use case/random-x/your-first resampling scheme. If predictors were controlled for, or the values of the predictors were set by the experimenter, you may consider using residual/model-based/fixed-x/your-second resampling scheme.

How do the two differ? An introduction to the bootstrap with applications in R by Davison and Kounen has a discussion pertinent to this question (see p.9). See also the R code in this appendix by John Fox, particularly functions boot.huber on p.5 for the random-x scheme and boot.huber.fixed on p.10 for the fixed-x scheme. While in the lecture notes by Shalizi the two schemes are applied to different datasets/problems, Fox's appendix illustrate how little difference the two schemes may often make.

When can the two be expected to deliver near identical results? One situation is when the regression model is correctly specified, e.g., there is no unmodelled nonlinearity and the usual regression assumptions (e.g., iid errors, no outliers) are satisfied. See chapter 21 of Fox's book (in which the aforementioned appendix with the R code indirectly belongs), particularly the discussion on page 598 and exercise 21.3. entitled "Random versus ﬁxed resampling in regression". To quote from the book

By randomly reattaching resampled residuals to ﬁtted values, the [fixed-x/model-based]
procedure implicitly assumes that the errors are identically distributed. If, for
example, the true errors have non-constant variance, then this property will not be  
reﬂected in the resampled residuals. Likewise, the unique impact of a high-leverage
outlier will be lost to the resampling.

You will also learn from that discussion why fixed-x bootstrap implicitly assumes that the functional form of the model is correct (even though no assumption is made about the shape of the error distribution).

See also slide 12 of this talk for Society Of Actuaries in Ireland by Derek Bain. It also has an illustration of what should be considered "the same result":

The approach of re-sampling cases to generate pseudo data is the more usual form of   
bootstrapping. The approach is robust in that if an incorrect model is fitted an
appropriate measure of parameter meter uncertainty is still obtained. However re
sampling residuals is more efficient if the correct model has been fitted.

The graphs shows both approaches in estimating the variance of a 26 point data sample
mean and a 52 point sample mean. In the larger sample the two approaches are  
equivalent.

Best Answer

Related Solutions

Solved – Bootstrap and randomization tests to compare paired data sets

Bootstrap for Regression – Two Ways to Estimate Confidence Intervals of Coefficients

Related Question