Bonferroni Correction: Assessing Its Suitability for Specific Statistical Tests

bonferronihypothesis testing

Here's my experiment : I'm comparing the responses of two cultural groups (high/low context, total n = 550) on several dependent variables (instruments).
The idea is to see if these two groups differ in their responses on these variables.

I have a total of 10 different dependent variables, and perform a wilcox pairwise test on each of them.

Here's what I do not understand: Is a bonferroni correction needed in this case? Does the case of 'multiple statistical tests' refer to testing for multiple (>2) groups and one dependent variable (for example, 4 groups need 6 comparisons) or does it also refer to the case of a single pairwise test and multiple dependent variables? How is the 'family' of tests defined?

Best Answer

So, I think a good way of approaching this problem might be to think about it the context of some more familiar statistical tests. Let's start by thinking our way through a simpler problem where the answer is more easily seen, and then we'll think about generalizing this out to your case.

Let's say that, instead of 10 dependent variables and 2 independent variables, you have just a single dependent variable but 10 independent variables (e.g., 10 different cultures you measured on just a single instrument). In this case, you might consider running a series of pairwise t-tests comparing Group A vs Group B, Group A vs Group C, Group B vs Group C, and so on. The issue with this approach is that you would need to adjust your p-values because you have multiple tests. Each of these tests is independent of the last one (e.g., a difference in Group A vs Group B has no effect on whether there's a difference between Group B and Group E). At the same time, each of these independent tests has a probability of 0.05 (or whatever alpha value you've specified) of incorrectly rejecting the null hypothesis. In order to perform full pairwise comparisons for all 10 groups, we would need to run 45 different t-tests that each independently give a 0.05 probability of incorrectly rejecting the null hypothesis. Since we have multiple tests that each have a 0.05 probability of a Type I error, we end up with a Type I error rate much larger than 0.05 unless we do something to control for the fact that we ran so many tests to get our final conclusions.

There are two dominant approaches to addressing this. One is to adjust your p-values for multiple corrections as you propose. In this simplest form, this is the Bonferonni correction where the desired Type I error rate (i.e., alpha value) is divided by the total number of tests you ran. There are other adjustments that might be preferable when you are concerned about persevering power or when there are correlations/dependence between the tests you run. The other approach is to use statistics that allow for singular tests across the group. Using the example of a single dependent variable measured over 10 different groups, we might prefer an ANOVA to a series of t-tests. The advantage of the ANOVA is that it is a single test that will tell us whether any of the means of the groups differ, so we don't have to adjust this p-value since we just have that one test. We then have special post-hoc analyses that we can follow-up the ANOVA with that also help to control familywise error.

So, in your case, you have multiple dependent variables. Some methods that are often not as commonly taught in intro statistics classes are multivariate statistics, which are the methods needed for analyzing multiple dependent variables at once. There are multivariate statistical extensions of most of the common statistical tests: multivariate linear regression, Hoetelling's T-squared, MANOVA, etc. The fact that you have utilized a non-parametric statistic suggests to me that there is another wrench thrown into your analyses as these non-parametric statistics are also commonly overlooked in general stats education. Without knowing more about what statistical software you're using, I'm hesitant to recommend very much more for you about how to extend the multivariate statistical methods to non-parametric statistics (or potentially bootstrapping them).

So, we now come to your specific problem where you've got multiple dependent variables being tested against your two groups. What is considered as part of your familywise error really boils down to why you are using the statistic in the first place. If you have a specific hypothesis for each and every of your dependent variables, then you can make a case for uncorrected p-values because you are using the result of each individual statistic to make a conclusion. I don't think that the case is very strong, but you can make it. If your hypothesis is very general (e.g., we think there might be some differences in some of these variables), then you absolutely have to do something to control that familywise error. Getting right down to the core of the p-value issue and Type I error-rates, anytime we run a test and make an inference on the p-value, we are accepting some probability of being wrong. When we run many tests in the course of a single study, we might be confident that the probability of any one of those tests being $\alpha$ is 0.05 (or whatever our nominal value is), but in reality, the probability that at least one the results in the study is a Type I error is not going to be $\alpha$. So, the "familywise error" is something of a matter of perspective, and it's part of the reason that emphasizing decision making based on p-values has been increasingly criticized by the statistics community. The most important thing is doing exactly what you've done, which is being critical of your decisions and trying to minimize the risk that you introduce unwanted error into your reporting. Ultimately, you may decide that an elevated Type I error rate is OK for your particular need. This sometimes happens in exploratory research with small sample sizes (e.g., pilot studies) where controlling for Type I error rates would reduce already low power. At the end of the day, whatever you decide to do, it is important that you know the limitations and the trade-off of that decision so that you can appropriately inform your audience of those caveats, though it should also be very obvious to them whatever decision you made as well and hopefully they're informed enough to know something of the consequences.

Related Question