Solved – How to achieve a two-sided combined p-value using Fisher’s method

combining-p-valueshypothesis testingmeta-analysismultiple-comparisonsp-value

Let's say I want to answer the question if smokers vs nonsmokers have different levels of Gene A. This seems like an obvious two-sided test. However, if I have multiple studies and I want to combine their p-values using Fisher's method, I now am confused how to accomplish this since Fisher's method is one-sided.

For example, let's say I am using a Wilcoxon Two-Sample rank sum test, and achieve the following results from 4 studies:

Study 1: Smokers have higher Gene A, p = 0.02
Study 2: Smokers have higher Gene A, p = 0.04
Study 3: Smokers have lower Gene A, p = 0.02
Study 2: Smokers have lower Gene A, p = 0.04

Because these are two-sided probabilities, I could not simply use Fisher's method on these p-values (would lose directionality), so instead I could calculate new p-values using a one-sided Wilcoxon test.

Testing the hypothesis that Gene A is greater in smokers, the data may instead look like this:

Study 1: Testing if Smokers have higher Gene A, p = 0.01
Study 2: Testing if Smokers have higher Gene A, p = 0.02
Study 3: Testing if Smokers have higher Gene A, p = 0.99
Study 2: Testing if Smokers have higher Gene A, p = 0.99

If I tried to use Fisher's method I get Χ² = 17.07459, df =8, p = 0.029. This is not the result I would expect, as I would expect the p-values to "cancel" out to a large extent.

Regardless of that, this requires me to have a notion of the appropriate direction to construct my Wilcoxon test, when in in reality I want a "two-sided" approach–I do not know if it will be greater or lower.

Is there a way to generate a two-sided combined p-value (ie one that I could construct from a set of 4 one-sided p values in one direction, and 4 corresponding one-sided p values in the other direction)?

Best Answer

As you have found out Fisher's method does not cancel values in opposite directions. The same is true of Tippett's method (which uses the minimum $p$). However the good news is that Stouffer's method (which $z$-transforms the $p$) and Edgington's method (which sums the $p$) do cancel so you could use one of them instead.

If you go to the page for the metap R package here and look at the vignette you will find some worked examples. Disclaimer: I am the author of that package.

Edit to add comments on directionality

The null hypothesis $H_0$ is well defined, that all $p_i$ have a uniform distribution on the unit interval. There are two classes of alternative hypothesis

$H_A$: all $p_i$ have the same (unknown) non--uniform, non--increasing density,
$H_B$: at least one $p_i$ has an (unknown) non--uniform, non--increasing density.

So these are basically omnibus tests where there is no obvious directionality built in.

The lack of a natural alternative hypothesis may account for the number of methods available and their differing behaviour.

Given that there is no obvious directionality built in what people do in a substantive application is take a look at the data and if they see $p$-values piling up near 0 assume the effect is in that direction and if near 1 the contrary direction. I suppose there is nothing to stop people observing piling up at both ends and interpreting accordingly. Another course of action would be to use one of the methods like Fisher's which does not cancel and then perform it on the complement of the $p$-values where it would be sensitive to piling up at the other end. I am not aware of any settings in which these have been applied but that may just be me.

Related Solutions

Solved – One-sided Fisher’s exact test and its complement

The particular table helps a lot. The Fisher's exact test assigns probabilities to tables with these particular marginals using the hypergeometric distribution. In this case, we're thinking of drawing 9 balls from an urn (the cases) with 2852 white balls (exposed) and 2861 black balls (not exposed). The number of white balls drawn is the count for exposed cases. The distribution is:

0     1     2     3     4     5     6     7     8     9 
0.002 0.018 0.071 0.165 0.247 0.246 0.163 0.070 0.017 0.002

The one-sided test in your output is giving the probability of 2 or fewer:

0.002 + 0.018 + 0.071 = 0.0904

The one-sided test in the other direction would give the probability of 2 or more, which is 1 minus the probability of 0 or 1:

1 - (0.002 + 0.018) = 0.98

Note that the two-sided test is the probability of 0, 1, 2, 7, 8, or 9, which does come to 0.179.

So the p-values for the two one-tailed tests don't add to one, because they each include the particular observed value and the distribution is discrete.

Solved – Significant two sided Wilcoxon rank sum test: Which group has higher median

Since I know that the Wilcoxon test compares pseudo-medians

Not quite. The Wilcoxon signed rank test compares the one-sample Hodges-Lehmann statistic (median-of-within-sample-pairwise-averages, equivalently median of Walsh averages, or pseudomedian) to 0. But the rank-sum test compares the two-sample Hodges-Lehmann statistic (the median of between-sample pairwise differences as described in the second paragraph under "Definition" at the link) to zero -- it does not compare two one-sample pseudomedians.

Based on my descriptive graphs, i.e. boxplots, jitter plots, it is not immediately visible which of the two groups I am comparing has the higher/lower median/mean.

You seem to be assuming that the difference-in-mean and the difference-in-median will behave like the median-pairwise-difference (in the sense that if one is different the other two will be and in the same direction).

This will often be true but it is not necessarily the case.

A population (or indeed, a sample) can have any of the three be different in some given direction while one or both the others are not different or even arranged in the opposite direction.

I am not sure which way would be appropriate to answer the question which of the two groups of which I know are significantly different w.r.t. the outcome variable, is is greater than the other.

If you're asking "which group did the rank sum test think had higher location?" (i.e. what difference caused the rejection?), then compute the median pairwise difference. This is a simple calculation and most decent stats packages will offer the calculation (at least as an option relating to a confidence interval for the difference) with the rank sum test.

If you're asking "in what direction do the medians differ" or "in what direction do the means differ" then you won't necessarily have an answer consistent with the rank sum test -- if you care about one of those, test that instead (perhaps with a permutation test based on those particular statistics).

If you assume that the distributions are the same up to a possible location shift under the alternative, and if you assume population means exist (generally a quite reasonable assumption) then you've already made the necessary assumption to attribute the direction of difference to the direction the rank sum test looked at.

Conduct a one-sided test and see which is significant.

You could, if you do it at half the significance level, but it would seem to be a fairly involved way to go about finding what can be obtained via a simple sample calculation. If you don't have a convenient way to do the calculation otherwise, it should work just fine.

Look at the means/medians within the two groups and whichever is greater is the one which is significantly greater.

You could find mean and median both have group 1 greater but the rank sum rejected in the other direction. Or you could have mean greater in one direction and median greater in the other direction. Or you could have both group-means and both group-medians be equal even though the rank-sum test rejected.

So this would not be a generally good choice.

Here's an example in R. First, adding the confidence interval calculation gets the location difference estimate, then calculating it directly from the sample. This assumes there's already data (in x and y):

> wilcox.test(x,y,conf.int=TRUE,conf.level=0.9)

        Wilcoxon rank sum test

data:  x and y
W = 9, p-value = 0.05927
alternative hypothesis: true location shift is not equal to 0
90 percent confidence interval:
 -15.3239889  -0.5774458
sample estimates:
difference in location 
             -8.891949 

> median(outer(x,y,"-")) # calculate median of pairwise differences
[1] -8.891949

So this tells us that y tends to be larger (as measured by the rank-sum statistic) than x (since the x-y differences tend to be negative)

An aside on this bit:

This let's me confidently reject the null hypothesis of no true location shift at a confidence level < 0.1

There's a few thing wrong there. That doesn't really let us "confidently" do anything, and 0.1 would be your significance level, not a confidence level. If I wanted to speak with some sort of confidence about an effect, I'd tend to be looking at effect sizes, confidence intervals and I'd at least want some (before-seeing-the-data) sense of the power against an anticipated/useful effect-size.

Best Answer

Related Solutions

Solved – One-sided Fisher’s exact test and its complement

Solved – Significant two sided Wilcoxon rank sum test: Which group has higher median

Related Question