Solved – When combining p-values, why not just averaging

central limit theoremcombining-p-valueshypothesis testingmultiple-comparisonsp-value

I recently learned about Fisher's method to combine p-values. This is based on the fact that p-value under the null follows a uniform distribution, and that $$-2\sum_{i=1}^n{\log X_i} \sim \chi^2(2n), \text{ given } X \sim \text{Unif}(0,1)$$
which I think is genius. But my question is why going this convoluted way? and why not (what is wrong with) just using mean of p-values and use central limit theorem? or median? I am trying to understand the genius of RA Fisher behind this grand scheme.

Best Answer

You can perfectly use the mean $p$-value.

Fisher’s method set sets a threshold $s_\alpha$ on $-2 \sum_{i=1}^n \log p_i$, such that if the null hypothesis $H_0$ : all $p$-values are $\sim U(0,1)$ holds, then $-2 \sum_i \log p_i$ exceeds $s_\alpha$ with probability $\alpha$. $H_0$ is rejected when this happens.

Usually one takes $\alpha = 0.05$ and $s_\alpha$ is given by a quantile of $\chi^2(2n)$. Equivalently, one can work on the product $\prod_i p_i$ which is lower than $e^{-s_\alpha/2}$ with probability $\alpha$. Here is, for $n=2$, a graph showing the rejection zone (in red) (here we use $s_\alpha = 9.49$. The rejection zone has area = 0.05.

Fisher

Now you can chose to work on ${1\over n} \sum_{i=1}^n p_i$ instead, or equivalently on $\sum_i p_i$. You just need to find a threshold $t_\alpha$ such that $\sum p_i$ is below $t_\alpha$ with probability $\alpha$; exact computation $t_\alpha$ is tedious – for $n$ big enough you can rely on central limit theorem; for $n = 2$, $t_\alpha = (2\alpha)^{1\over 2}$. The following graph shows the rejection zone (area = 0.05 again).

sum of p values

As you can imagine, many other shapes for the rejection zone are possibles, and have been proposed. It is not a priori clear which is better – i.e. which has greater power.

Let‘s assume that $p_1$, $p_2$ come from a bilateral $z$-test with non-centrality parameter 1 :

> p1 <- pchisq( rnorm(1e4, 1, 1)**2, df=1, lower.tail=FALSE )
> p2 <- pchisq( rnorm(1e4, 1, 1)**2, df=1, lower.tail=FALSE )

Let's have a look on the scatterplot with in red the points for which the null hypothesis is rejected.

Scatterplot

The power of Fisher’s product method is approximately

> sum(p1*p2<exp(-9.49/2))/1e4
[1] 0.2245

The power of the method based on the sum of $p$-values is approximately

> sum(p1+p2<sqrt(0.1))/1e4
[1] 0.1963

So Fisher’s method wins – at least in this case.