Hypothesis Testing – Why Complementary p-values Do Not Cancel in Combining Methods

combining-p-valueshypothesis testingmeta-analysis

What is this about?

Suppose I have performed two statistical tests (with a continuous distribution of $p$ values) for the same one-sided research hypothesis on different datasets, yielding the $p$ values $p_1=x$ and $p_2=1-x$.
I have no further a priori knowledge that makes a distinction between the datasets, e.g., to weigh them.
In this case, I do not learn anything from my tests as to whether my research hypothesis is true or not.
If I tested the opposite research hypothesis, I would obtain the same results (just in different order).
Therefore the accurate result for combining these $p$ values is $p_\text{comb} = \frac{1}{2}$.

However many methods for combining $p$ values do not treat $p$ values and their complements in a symmetrical fashion and produce implausible results.
For example for $x=0.01$:

  • Fisher’s method: $p_\text{comb} = 0.06$
  • Pearson’s method: $p_\text{comb} = 0.94$
  • Tippett’s method: $p_\text{comb} = 0.02$
  • Simes’ method: $p_\text{comb} = 0.02$

By contrast, Stouffer’s method, Mudholkar’s and George’s method, as well as Edgington’s method are symmetric (as described above) and produce $p_\text{comb} = \frac{1}{2}$.

Obviously, this problem extends beyond the above simple example and can lead to many clear false positives in many cases. I could probably produce datasets where two opposing one-side research hypotheses are both highly significant.

My question

I consider it a serious flaw of a combining method if complementary $p$ values do not cancel each other.
However, I fail to find this issue mentioned in the literature on combining $p$ values.
To give just one example, the paper Choosing Between Methods of Combining $p$-values does not mention it as far as I can tell.
In fact, the only mention I have found so far is here on this site.

So, what am I missing?

  • Is there literature on this (and I failed to find it)?
  • Is my argument somehow flawed and I am overestimating the importance of this?
  • Is this generally accepted, but just not documented?

Best Answer

I am not sure when the methods were first systematically compared but Loughin in a paper entitled "A systematic comparison of methods for combining p-values from independent tests" available here compared some of them. He does refer to previous work which discussed the issue you raise. Some other methods are compared in my R package metap and the comparisons set out in one of the vignettes. (Apologies for the self-promotion.)

I am not an expert on modern genetics but I believe that there the interest is in which tests display the signal and so a method which does not cancel is quite acceptable and indeed even preferred.

As you suggest plotting the $p$-values should really be obligatory as then it becomes clear that there are extreme values in both directions which should lead the investigator to question their theory. I am not aware of substantial sources recommending this though.

Related Question