Bootstrap – Why Resample Under Null Hypothesis in Hypothesis Testing

bootstraphypothesis testing

The straightforward application of bootstrap methods to hypothesis testing is to estimate the confidence interval of the test statistic $\hat{\theta}$ by repeatedly calculating it on the bootstrapped samples (Let the statistic $\hat{\theta}$ sampled from bootstrap be called $\hat{\theta^*}$). We reject $H_0$ if the hypothesized parameter $\theta_0$ (which usually equals 0) lies outside of the confidence interval of $\hat{\theta^*}$.

I've read, that this method lacks some power. In the article by Hall P. and Wilson S.R. "Two Guidelines for Bootstrap Hypothesis Testing" (1992) it is written as the first guideline, that one should resample $\hat{\theta^*} – \hat{\theta}$, not the $\hat{\theta^*} – \theta_0$. And this is the part I don't understand.

Isn't that the $\hat{\theta^*} – \hat{\theta}$ measures just the bias of the estimator $\hat{\theta^*}$? For unbiased estimators the confidence intervals of this expression should always be smaller than $\hat{\theta^*} – \theta_0$, but I fail to see, what it has to do with testing for $\hat{\theta}=\theta_0$? There is nowhere I can see we put information about the $\theta_0$.


For those of you, who do not have access to this article, this is a quote of the relevant paragraph which comes immediately after the thesis:

To appreciate why this is important, observe that the test will
involve rejecting $H_0$ if in $\left| \hat{\theta} – \theta_0\right|$
is "too large." If $\theta_0$ is a long way from true value of
$\theta$ (i.e., if $H_0$ is grossly the error) then the difference
$\left|\hat{\theta} – \theta_0 \right|$ will never look very much too
big compared to the nonparametric bootstrap distribution of $\left| \hat{\theta} – \theta_0\right|$. A more meaningful comparison is with the
distribution of $\left| \hat{\theta^*} – \hat{\theta}\right|$. In
fact, if the true value of $\theta$ is $\theta_1$ then the power of
the bootstrap test increases to 1 as $\left|\theta_1 –
\theta_0\right|$ increases, provided test is based on resampling
$\left| \hat{\theta^*} – \hat{\theta}\right|$ , but the power
decreases to at most the significance level (as $\left|\theta_1 –
\theta_0\right|$ increases) if the test is based on resampling
$\left|\hat{\theta} – \theta_0\right|$

Best Answer

This is the bootstrap analogy principle. The (unknown) underlying true distribution $F$ produced a sample at hand $x_1, \ldots, x_n$ with cdf $F_n$, which in turn produced the statistic $\hat\theta=T(F_n)$ for some functional $T(\cdot)$. Your idea of using the bootstrap is to make statements about the sampling distribution based on a known distribution $\tilde F$, where you try to use an identical sampling protocol (which is exactly possible only for i.i.d. data; dependent data always lead to limitations in how accurately one can reproduce the sampling process), and apply the same functional $T(\cdot)$. I demonstrated it in another post with (what I think is) a neat diagram. So the bootstrap analogue of the (sampling + systematic) deviation $\hat\theta - \theta_0$, the quantity of your central interest, is the deviation of the bootstrap replicate $\hat\theta^*$ from what is known to be true for the distribution $\tilde F$, the sampling process you applied, and the functional $T(\cdot)$, i.e. your measure of central tendency is $T(\tilde F)$. If you used the standard nonparametric bootstrap with replacement from the original data, your $\tilde F=F_n$, so your measure of the central tendency has to be $T(F_n) \equiv \hat \theta$ based on the original data.

Besides the translation, there are subtler issues going on with the bootstrap tests which are sometimes difficult to overcome. The distribution of a test statistic under the null may be drastically different from the distribution of the test statistic under the alternative (e.g., in tests on the boundary of the parameter space which fail with the bootstrap). The simple tests you learn in undergraduate classes like $t$-test are invariant under shift, but thinking, "Heck, I just shift everything" fails once you have to move to the next level of conceptual complexity, the asymptotic $\chi^2$ tests. Think about this: you are testing that $\mu=0$, and your observed $\bar x=0.78$. Then when you construct a $\chi^2$ test $(\bar x-\mu)^2/(s^2/n) \equiv \bar x^2/(s^2/n)$ with the bootstrap analogue $\bar x_*^2/(s_*^2/n)$, then this test has a built-in non-centrality of $n \bar x^2/s^2$ from the outset, instead of being a central test as we would expect it to be. To make the bootstrap test central, you really have to subtract the original estimate.

The $\chi^2$ tests are unavoidable in multivariate contexts, ranging from Pearson $\chi^2$ for contingency tables to Bollen-Stine bootstrap of the test statistic in structural equation models. The concept of shifting the distribution is extremely difficult to define well in these situations... although in case of the tests on the multivariate covariance matrices, this is doable by an appropriate rotation.

Related Question