# Hypothesis Testing – Why Complementary p-values Do Not Cancel in Combining Methods

combining-p-valueshypothesis testingmeta-analysis

Suppose I have performed two statistical tests (with a continuous distribution of $$p$$ values) for the same one-sided research hypothesis on different datasets, yielding the $$p$$ values $$p_1=x$$ and $$p_2=1-x$$.
I have no further a priori knowledge that makes a distinction between the datasets, e.g., to weigh them.
In this case, I do not learn anything from my tests as to whether my research hypothesis is true or not.
If I tested the opposite research hypothesis, I would obtain the same results (just in different order).
Therefore the accurate result for combining these $$p$$ values is $$p_\text{comb} = \frac{1}{2}$$.

However many methods for combining $$p$$ values do not treat $$p$$ values and their complements in a symmetrical fashion and produce implausible results.
For example for $$x=0.01$$:

• Fisher’s method: $$p_\text{comb} = 0.06$$
• Pearson’s method: $$p_\text{comb} = 0.94$$
• Tippett’s method: $$p_\text{comb} = 0.02$$
• Simes’ method: $$p_\text{comb} = 0.02$$

By contrast, Stouffer’s method, Mudholkar’s and George’s method, as well as Edgington’s method are symmetric (as described above) and produce $$p_\text{comb} = \frac{1}{2}$$.

Obviously, this problem extends beyond the above simple example and can lead to many clear false positives in many cases. I could probably produce datasets where two opposing one-side research hypotheses are both highly significant.

### My question

I consider it a serious flaw of a combining method if complementary $$p$$ values do not cancel each other.
However, I fail to find this issue mentioned in the literature on combining $$p$$ values.
To give just one example, the paper Choosing Between Methods of Combining $$p$$-values does not mention it as far as I can tell.
In fact, the only mention I have found so far is here on this site.

So, what am I missing?

• Is there literature on this (and I failed to find it)?
• Is my argument somehow flawed and I am overestimating the importance of this?
• Is this generally accepted, but just not documented?

As you suggest plotting the $$p$$-values should really be obligatory as then it becomes clear that there are extreme values in both directions which should lead the investigator to question their theory. I am not aware of substantial sources recommending this though.