Solved – t test p value vs randomization-inference p value: What can we learn from comparison

hypothesis testingp-valuepermutation-teststatistical significance

How can we interpret differences between t test p values and randomization-inference p values?

Let’s say we have a randomized experiment with a binary treatment, denoted $Z_i = 1$ if unit $i$ is assigned to treatment, and outcomes, denoted $Y_i$.

We want test for a treatment effect.

We test both the sharp null hypothesis of no effect and the null hypothesis of no average effect.

Definition $H_{0,sharp}$: Sharp null hypothesis of no effect
The treatment effect is zero for all subjects. Formally, $Y_i(1) = Y_i(0)$ for all $i$.

Definition $H_{0,weak}$: Null hypothesis of no average effect (sometimes called the weak null hypothesis)
The average treatment effect is zero. Formally, $\mu_{Y(1)} = \mu_{Y(0)}$.

We test $H_{0,sharp}$ using randomization inference (RI) and we test $H_{0,weak}$ with a t test.

If we run these two tests and get different answers, what are useful ways to interpret differences between the t test p value and the RI p value?

Strictly speaking, the two procedures test different hypotheses and they cannot be meaningfully compared, but this is not very useful, and will not satisfy non-specialists (people with substantive rather than technical interest in your research) who want to understand why your results look different when using RI or a t test. Furthermore, the two tests are alternative approaches to answer the same substantive question, “was there a treatment effect?” We should have a guidelines for thinking about different answers to the same substantive question.

A good answer would have a general enough discussion of differences to encompass differences in p values that would lead us to different statistical conclusions (e.g., one test p<0.05 and the other p>0.05) and those that would lead to make the same conclusion from both tests (e.g., both tests p<0.05 or both tests p>0.05).


Notes on RI

For those unfamiliar with RI: The RI p value is calculated by, first, computing the distribution of the test statistic across all (or many) treatment assignments, which is called the null or randomization distribution. The RI p value denotes the proportion of the randomization distribution that is larger than our observed test statistic. (More discussion here, particularly page 5.)

We can conduct RI by calculate the test statistic for all possible permuted treatment assignment vectors (to calculate an exact RI p value) or using a large sample of permuted treatment assignment vectors (to calculate an asymptotic RI p value). As Gerber and Green (2012) write, “Whether one uses all possible randomizations or a large sample of them, the calculation of p values based on an inventory of possible randomizations is called randomization inference.

Best Answer

I think your statement of the randomization inference null hypothesis is incorrect. Or at least, you're confusing two methods to test hypotheses versus two different hypotheses. The randomization test aka the permutation test considers the exact or approximation distribution of test statistics obtained when "labels" are randomly swapped between treatment/control subjects. This can be used to test the weak null hypothesis of no average treatment effect by calculating the t-test statistic for each permuted dataset and evaluating the proportion of these exceeding the one obtained in the unpermuted dataset.

In this working article they frame the hypothesis of treatment effect variation as one of homogeneity where the average treatment effect is considered a nuisance parameter: basically "I don't care whether this drug works, I just want to know if it works differently in some people than it does in others." The effect for the first hypothesis, tested using usually analysis of parallel design, is called average treatment effect (ATE) and the second hypothesis here has been called treatment effect variation (TEV). Testing for TEV smells of a test of effect modification, in the absence of a known effect modifier, and resembles subgroup analysis. Using randomization tests for TEV is a novel and interesting method to consider and is worth reading this article in depth to understand how exactly they formulated such a test.

To summarize how the two hypothesis might agree or disagree in a $2 \times 2$ table:

Case 1: ATE no TEV: the drug works and it has the same potential outcome in everybody regardless. Solution: do not recommend if harmful, consider effect size before recommending approval/use.

Case 2: no ATE no TEV: the drug does not work in anyone. Solution: conclude drug is futile relative to standard of care.

Case 3: no ATE, TEV: the drug works in individuals in such a contrived way that the harm in some and the benefit in others is completely balanced. Solution: identify indicators/contraindicators of harm/benefit subgroups and conduct follow-up study if predicted benefit is of clinical significance.

Case 4: ATE, TEV: the drug shows some average effect but this effect is not the same in everyone. Solution: identify harm groups if any and establish contraindications, predict benefit in remaining group and conduct follow-up study if it is of clinical significance.

Related Question