### What is this about?

Suppose I have performed two statistical tests (with a continuous distribution of $p$ values) for the same one-sided research hypothesis on different datasets, yielding the $p$ values $p_1=x$ and $p_2=1-x$.

I have no further a priori knowledge that makes a distinction between the datasets, e.g., to weigh them.

In this case, I do not learn anything from my tests as to whether my research hypothesis is true or not.

If I tested the opposite research hypothesis, I would obtain the same results (just in different order).

Therefore the accurate result for combining these $p$ values is $p_\text{comb} = \frac{1}{2}$.

However many methods for combining $p$ values do not treat $p$ values and their complements in a symmetrical fashion and produce implausible results.

For example for $x=0.01$:

- Fisher’s method: $p_\text{comb} = 0.06$
- Pearson’s method: $p_\text{comb} = 0.94$
- Tippett’s method: $p_\text{comb} = 0.02$
- Simes’ method: $p_\text{comb} = 0.02$

By contrast, Stouffer’s method, Mudholkar’s and George’s method, as well as Edgington’s method are symmetric (as described above) and produce $p_\text{comb} = \frac{1}{2}$.

Obviously, this problem extends beyond the above simple example and can lead to many clear false positives in many cases. I could probably produce datasets where two opposing one-side research hypotheses are both highly significant.

### My question

I consider it a serious flaw of a combining method if complementary $p$ values do not cancel each other.

However, I fail to find this issue mentioned in the literature on combining $p$ values.

To give just one example, the paper *Choosing Between Methods of Combining $p$-values* does not mention it as far as I can tell.

In fact, the only mention I have found so far is here on this site.

So, **what am I missing?**

- Is there literature on this (and I failed to find it)?
- Is my argument somehow flawed and I am overestimating the importance of this?
- Is this generally accepted, but just not documented?

## Best Answer

I am not sure when the methods were first systematically compared but Loughin in a paper entitled "A systematic comparison of methods for combining p-values from independent tests" available here compared some of them. He does refer to previous work which discussed the issue you raise. Some other methods are compared in my R package metap and the comparisons set out in one of the vignettes. (Apologies for the self-promotion.)

I am not an expert on modern genetics but I believe that there the interest is in which tests display the signal and so a method which does not cancel is quite acceptable and indeed even preferred.

As you suggest plotting the $p$-values should really be obligatory as then it becomes clear that there are extreme values in both directions which should lead the investigator to question their theory. I am not aware of substantial sources recommending this though.