Solved – the correct terminology for one/two-tailed p-values and how to apply the Holm-Bonferroni correction

bonferronimultiple-comparisonsp-valueterminology

I have been struggling to reconcile my (very basic) understanding of P-values with the approach of one of my colleagues and it appears to come down to interpretation of p-value terminology. This is having a knock-on effect on the correct implementation of a Holm-Bonferroni correction – any advice would be welcome!

The problem is this:

We have run a statistical test to look at whether the expected number of observations is higher or lower than expected – thus it is a two tailed test (<2.5% or >97.5%). The number of observations was converted to a z-score (<-1.96 or >1.96) and then into a probability (P). For our reporting we had two options:

Report the P-value as rejected if P<2.5% or P>97.5%
Convert the P-value into a one-tailed value and reject if its P<5%

The issue arises because my colleague believes the following:

Option 2 is the more widely used definition. When we see "two-tailed p-value" referred to, this is usually what is meant. And when we apply Holm-Bonferroni, it is expecting rejection to correspond to p-value below 5%, so we need this form of the p-value.

and

What we report as a p-value under option 1 is what would universally be reported as the p-value for a one-tailed test, so can be referred to as a "one-tailed" p-value. The p-value under option 2 is what would generally be referred to as a "two-tailed" p-value, i.e. a p-value which has been converted to reflect the fact that the test is being run as a two-tailed test.

This is something I am struggling with – is this actually true because it seems completely counter-intuitive to me? I can find nothing on the internet to suggest that this is the case either.

This interpretation is critical because we go on to perform a Holm-Bonferroni correction (there are > 50 tests in total), which requires (what I understand to be) one-sided style P-values (P<5%) – thus I believe that our 'two-sided' p-value from option 1 should be divided by 2 (0.5*P or 0.5*(1-P)) to convert it appropriately before running the test. however my colleague believes it should be multiplied by 2 (2*P or 2*(1-P)) because Option 1 is actually known as a 'one-sided' value…

Can anyone offer any guidance on this to me as what I thought was a relatively simple concept is now confusing me greatly!

Best Answer

Most of what you've said is correct, but I think you might be confusing yourself unnecessarily by having different definitions of a one and two tailed test. Just to briefly review:

A one-tailed test is the probability that the area of your distribution ($t$, $Z$, $f$, etc.) is

Above your observed test statistic
Below your observed test statistic.

Thus if you are expected a negative effect (and using $t$ distribution as an example), your $P$-value would be equal to

$$ P(t \leq t_{observed}) $$

or for a positive effect,

$$ P(t \geq t_{observed}) $$

both with the appropriate degrees of freedom.

A two-tailed test is equal to the probability that the area of your distribution is EITHER above or below your observed test statistic, equal to

$$ P(t \geq t_{observed}) + P(t \leq t_{observed}) \\ P( t \geq | \, t_{observed} \, |) $$

Because the $t$ distribution is symmetrical around 0, the above two probabilities are equal. Therefore, the two tailed $P$ value is twice the one-tailed $P$ value.

Depending on the direction of effect, to convert your two-tailed $P$-value to a one tailed $P$-value, you must divide the former by 2.

$$ P_{one-tailed} = \begin{cases} (\frac{1}{2} P_{two-tailed}) & \text{ in the right direction} \\ (1-\frac{1}{2} P_{two-tailed}) & \text{ in the other direction} \end{cases} $$

Just the clarify, Holm-Bonferonni only requires that all $P$-values have the same null hypothesis. That is, either they are all two-sided or all one-sided. It doesn't make much sense to compare things with different nulls.

Related Solutions

Solved – How to apply multiple testing correction for gene list overlap using R

I don't know anything about gene expression studies but I do have some interest in multiple inference so I will risk an answer on this part of the question anyway.

Personally, I would not approach the problem in that way. I would adjust the error level in the original studies, compute the new overlap and leave the test at the end alone. If the number of differentially expressed genes (and any other result you are using) is already based on adjusted tests, I would argue that you don't need to do anything.

If you cannot go back to the original data and really do want to adjust the p-value, you can indeed multiply it by the number of tests but I don't see why it should have anything to do with the size of list2. It would make more sense to adjust for the total number of tests performed in both studies (i.e. two times the population). This is going to be brutal, though.

To adjust p-values in R, you can use p.adjust(p), where p is a vector of p-values.

p.adjust(p, method="bonferroni") # Bonferroni method, simple multiplication
p.adjust(p, method="holm") # Holm-Bonferroni method, more powerful than Bonferroni
p.adjust(p, method="BH") # Benjamini-Hochberg

As stated in the help file, there is no reason not to use Holm-Bonferroni over Bonferroni as it also provides strong control of the familywise error rate in any case but is more powerful. Benjamini-Hochberg controls the false discovery rate, which is a less stringent criterion.

Edited after the comment below:

The more I think about the problem, the more I think that a correction for multiple comparisons is unnecessary and inappropriate in this situation. This is where the notion of a “family” of hypotheses kicks in. Your last test isn't quite comparable to all the earlier tests, there is no risk of “capitalizing on chance” or cherry-picking significant results, there is only one test of interest and it's legitimate to use the ordinary error level for this one.

Even if you correct aggressively for the many tests performed before, you would still not be directly addressing the main concern, which is the fact that some of the genes in both lists might have been spuriously detected as differentially expressed. The earlier test results still “stand” and if you want to interpret these results while controlling the family-wise error rate, you still need to correct all of them too.

But if the null hypothesis really is true for all genes, any significant result would be a false positive and you would not expect the same gene to be flagged again in the next sample. Overlap between both lists would therefore happen only by chance and this is exactly what the test based on the hypergeometric distribution is testing. So even if the lists of genes are complete junk, the result of that last test is safe. Intuitively, it seems that anything in-between (a mix of true and false hypotheses) should be fine too.

Maybe someone with more experience in this field might weigh in but I think an adjustment would only become necessary if you want to compare the total number of genes detected or find out which ones are differentially expressed, i.e. if you want to interpret the thousands of individual tests performed in each study.

Solved – Interpreting one- and two-tailed tests

You don't choose a one-tailed test based on near-significance in a two-tailed test.

You don't choose the direction of a one-tailed test based on directional information from the data.

Or at the least, if you do those things, you must also double the resulting p-value.

A one tailed test - if you do one at all - must be based on prior considerations, in place before you know what is in the data. If this is not the case, the significance levels (and p-values) are meaningless.

Best Answer

Related Solutions

Solved – How to apply multiple testing correction for gene list overlap using R

Solved – Interpreting one- and two-tailed tests

Related Question