Solved – p-value correction for multiple t-tests

adjustmentbonferronimultiple-comparisonsp-valuet-test

My dataset consists of $n$ genes, each of them described by a vector of expression values, $5$ for "healthy" individuals, and $5$ for "unhealthy" individuals.

I am going to run $n$ t-tests (one for each gene) to identify which genes show a different behaviour between healthy population and unhealthy population.

Should I consider a correction (such as Bonferroni, Holm, Benjamini & Hochberg…) for the $n$ p-values ?

EDIT:

I am wondering whether my case is a multiple comparisons problem or not.

Actually I do not compare the genes, but only the values of two different populations (healthy vs. unhealthy) for each gene. Therefore, I do not see the multiple comparisons.

In other words, I am interested in finding those genes that behave differently between healthy samples and unhealthy samples. I am not interested in finding whether or not two genes behave the same.

Obviously, running $n$ t-tests I get much more p-values lower that $0.05$ than after computing the correction.

Best Answer

You absolutely do want to apply a correction. The key idea is identifying significance by chance. As you increase the number of comparisons you increase the number of those that will be significant by chance.

For example, let's take the generic example of doing 100 comparisons using a significance threshold of 0.05. Now, a p-value of 0.05 means there is a 5% chance of getting that result when the null hypothesis is true. Therefore, if you do these 100 comparisons, you would expect to find 5 genes significant just by random chance.

As such, to avoid making these false-positives (Type 1 Errors) we 'correct' the p-value thereby making the test more conservative.

The choice in correction can vary too. Bonferroni is a common correction but if you have 1000's of genes, it is going to be exceedingly unlikely you will find anything significant because it will be so conservative. In that case, you may use the 'FDR' (False Discovery Rate) correction. There is no absolute answer so you need to explore the possibilities and make the best choice and of course report what correction you applied.

EDIT

Regarding you comments below I thought an example can help demonstrate the concept.

Using R, I generate completely random values for 250 genes with two treatments (A and B)

set.seed(8)
df <- data.frame(expression=runif(1000), 
                 gene=rep(paste("gene", seq(250)), 4), 
                 treatment = rep(c("A","A","B","B"), each=250))

I then split the data by each gene and run a t.test comparing between the two groups.

out <- do.call("rbind", 
    lapply(split(df, df$gene), function(x) t.test(expression~treatment, x)$p.value))

Now, given that this is completely random data there shouldn't be any significant differences and yet when I count how many there are 9 significant genes!!!

length(which(out < 0.05))
[1] 9

Avoiding mistakes like these is the point behind making these corrections. Hopefully this helps clarify for you.