Solved – Wilcoxon test with multiple testing: which correction for p values

bonferronifalse-discovery-ratewilcoxon-signed-rank

Please, I'm not very confident in statistics, and I'm trying to respond to a reviewer for a paper on the following issues:

In my experiment I observed 15 babies during a test where they were free to play with an experimental toy, for 10 minutes

  • each baby was tested individually
  • age ranges from 8 months to 36 months
  • during the test I recorder the durations in seconds of 12 selected behaviours (i.e. how long the baby smile? how long the baby explore the toy? and so on…)

In order to see if there were differences due to the age, I split the group in two samples (threshold: 24 months) with N=7 and N=8.

I then run a Wilcoxon rank sum test to compare, for each behaviour, the averages of durations, obtaining 12 p values, some of which are significant (values lower than alpha=0.05 )

The reviewer says that I need to correct alpha with Bonferroni, as I'm performing a multiple testing.

This leads alpha to be very low:

alpha corrected = 0.05/12 = 0.004

with the consequence that all significant results disappear.

Now, googling a little bit, I found that Bonferroni is not a good method when comparisons are more that 3 or 4, as it is too conservative, and False Discovery Rate (FDR) is proposed instead.

Do you agree on this?

Best Answer

Suppose you have a collection of hypotheses $H_1, \dots, H_s$ that is under consideration.

While Bonferroni correction controls the Family-wise Error Rate (FWER), its ability to detect cases when a hypothesis $H_i$, $i=1,\dots,s$ is false is low since the Bonferroni condition $\alpha/s$ is quite stringent. In other words, what you're observing is the result of testing against a much smaller level than the conventional $\alpha$ level.

However, if you still want to use a Bonferroni-like procedure, the Holm procedure (or any stepdown procedure for that matter) will control the FWER while individual tests are increased over the $\alpha/s$ level of the Bonferroni correction.

The False Discovery Rate (FDR) is definitely a weakening of FWER. In general $FDR\le FWER$, so the FDR is more liberal (more rejections) than the FWER.

A final word, I wouldn't go into saying that " Bonferroni is not a good method when comparisons are more that 3 or 4, as it is too conservative" as your search concluded. In fact, Lehmann and Romano (2005) state that "when the number of tests is in the tens or hundreds of thousands, control of the FWER at conventional levels becomes so stringent that individual departures from the hypothesis have little chance of being detected."

I hope this helps.