Solved – Non-parametric bootstrap p-values vs confidence intervals

bootstrapconfidence intervalp-value

Context

This is somewhat similar to this question, but I do not think it is an exact duplicate.

When you look for how instructions on how to perform a bootstrap hypothesis test, it is usually stated that it is fine to use the empirical distribution for confidence intervals but that you need to correctly bootstrap from the distribution under the null hypothesis to get a p-value. As an example, see the accepted answer to this question. A general search on the internet mostly seems to turn up similar answers.

The reason for not using a p-value based on the empirical distribution is that most of the time we do not have translation invariance.

Example

Let me give a short example. We have a coin and we want to do an one-sided test to see if the frequency of heads is larger than 0.5

We perform $n = 20$ trials and get $k = 14$ heads. The true p-value for this test would be $p = 0.058$.

On the other hand if we bootstrap our 14 out of 20 heads, we effectively sample from the binomial distribution with $n = 20$ and $p = \frac{14}{20}=0.7$. Shifting this distribution by subtracting 0.2 we will get a barely significant result when testing our observed value of 0.7 against the obtained empirical distribution.

In this case the discrepancy is very small, but it gets larger when the success rate we test against gets close to 1.

Question

Now let me come to the real point of my question: the very same defect also holds for confidence intervals. In fact, if a confidence interval has the stated confidence level $\alpha$ then the confidence interval not containing the parameter under the null hypothesis is equivalent to rejecting the null hypothesis at a significance level of $1- \alpha$.

Why is it that the confidence intervals based upon the empirical distribution are widely accepted and the p-value not?

Is there a deeper reason or are people just not as conservative with confidence intervals?

In this answer Peter Dalgaard gives an answer that seems to agree with my argument. He says:

There's nothing particularly wrong about this line of reasoning, or at
least not (much) worse than the calculation of CI.

Where is the (much) coming from? It implies that generating p-values that way is slightly worse, but does not elaborate on the point.

Final thoughts

Also in An Introduction to the Bootstrap by Efron and Tibshirani they dedicate a lot of space to the confidence intervals but not to p-values unless they are generated under a proper null hypothesis distribution, with the exception of one throwaway line about the general equivalence of confidence intervals and p-values in the chapter about permutation testing.

Let us also come back to the first question I linked. I agree with the answer by Michael Chernick, but again he also argues that both confidence intervals and p-values based on the empirical bootstrap distribution are equally unreliable in some scenarios. It does not explain why you find many people telling you that the intervals are ok, but the p-values are not.

Best Answer

As @MichaelChernick said in response to a comment on his answer to a linked question:

There is a 1-1 correspondence in general between confidence intervals and hypothesis tests. For example a 95% confidence interval for a model parameter represents the non-rejection region for the corresponding 5% level hypothesis test regarding the value of that parameter. There is no requirement about the shape of the population distributions. Obviously if it applies to confidence intervals in general it will apply to bootstrap confidence intervals.

So this answer will address two associated issues: (1) why might presentations of bootstrap results seem more frequently to specify confidence intervals (CI) rather than p-values, as suggested in the question, and (2) when might both p-values and CI determined by bootstrap be suspected to be unreliable thus requiring an alternate approach.

I don't know data that specifically support the claim in this question on the first issue. Perhaps in practice many bootstrap-derived point estimates are (or at least seem to be) so far from test decision boundaries that there is little interest in the p-value of the corresponding null hypothesis, with primary interest in the point estimate itself and in some reasonable measure of the magnitude of its likely variability.

With respect to the second issue, many practical applications involve "symmetrical distribution of test statistic, pivotal test statistic, CLT applying, no or few nuisance parameters etc" (as in a comment by @XavierBourretSicotte above), for which there is little difficulty. The question then becomes how to detect potential deviations from these conditions and how to deal with them when they arise.

These potential deviations from ideal behavior have been appreciated for decades, with several bootstrap CI approaches developed early on to deal with them. The Studentized bootstrap helps provide a pivotal statistic, and the BCa method deals with both bias and skewness in terms of obtaining more reliable CI from bootstraps. Variance-stabilizing transformation of data before determining bootstrapped CI, followed by back-transformation to the original scale, also can help.

The example in this question on sampling from 14 heads out of 20 tosses from a fair coin is nicely handled by using CI from the BCa method; in R:

> dat14 <- c(rep(1,14),rep(0,6))
> datbf <- function(data,index){d <- data[index]; sum(d)}
> set.seed(1)
> dat14boot <- boot(dat14,datbf,R=999)
> boot.ci(dat14boot)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 999 bootstrap replicates

CALL : 
boot.ci(boot.out = dat14boot)

Intervals : 
Level      Normal              Basic         
95%     (9.82, 18.22 )   (10.00, 18.00 )  

Level     Percentile            BCa          
95%       (10, 18 )         ( 8, 17 )  
Calculations and Intervals on Original Scale

The other CI estimates pose the noted problem of being very close to or at the edge of the population value of 10 heads per 20 tosses. The BCa CI account for skewness (as introduced by binomial sampling away from even odds), so they nicely include the population value of 10.

But you have to be looking for such deviations from ideal behavior before you can take advantage of these solutions. As in so much of statistical practice, actually looking at the data rather than just plugging into an algorithm can be key. For example, this question about CI for a biased bootstrap result shows results for the first 3 CI shown in the above code, but excluded the BCa CI. When I tried to reproduce the analysis shown in that question to include BCa CI, I got the result:

> boot.ci(boot(xi,H.boot,R=1000))
Error in bca.ci(boot.out, conf, index[1L], L = L, t = t.o, t0 = t0.o,  : 
estimated adjustment 'w' is infinite

where 'w' is involved in the bias correction. The statistic being examined has a fixed maximum value and the plug-in estimate that was bootstrapped was also inherently biased. Getting a result like that should indicate that the usual assumptions underlying bootstrapped CI are being violated.

Analyzing a pivotal quantity avoids such problems; even though an empirical distribution can't have useful strictly pivotal statistics, coming as close as reasonable is an important goal. The last few paragraphs of this answer provide links to further aids, like pivot plots to estimate via bootstrap whether a statistic (potentially after some data transformation) is close to pivotal, and the computationally expensive but potentially decisive double bootstrap.

Related Question