ANOVA – Understanding Significant Interaction and Non-Significant Post-Hoc Results

anovainteraction

I am running an ANOVA (in R, using the afex package and aov_car function), with two inter-subject variable (Group, which can have two modalities and Sex, which can also have to modalities) and one intra-subject variable (SF). I study the latence of cerebral activity, in ms.
I found a significant (crossover) interaction between group and sex.
Descriptive statistics indicate that the females in group 1 have a faster activity than males in group 1 but females in group 2 have a slower activity than males in group 2.
However, post-hoc tests (using emmeans in R, with pairs and bonferroni correction) were non significant.

My interpretation of the significant interaction is that the hypothesis that FG1-MG1 = FG2-MG2 has to be rejected. However, because the differences are not high and because of the cross-over, each difference taken individually are not significant (non significant post-hoc).

However, going through a forum, someone wrote that this interpretation is wrong because in the ANOVA, an interaction means that "at least one population mean is different from the other".

Could you please help me to determine how to correctly interpret the interaction in the ANOVA?

In other words, does the interaction means that we test "the difference of differences" or do we test if one mean is different from the other?

Cheers

Best Answer

I find it easiest to think about this in the form of a linear regression. Let $Y$ be the continuous outcome, $G$ represent the group (value of 0 for the reference group and 1 for the other), and $S$ represent sex (0 for male, 1 for female). Then your model is:

$$ Y = \beta_0 + \beta_G G + \beta_S S + \beta_{GS}GS$$

with intercept $\beta_0$ the estimated outcome when $G=0$ (reference group) and $S=0$ (male). A "significant" interaction means that you don't accept the null hypothesis of $\beta_{GS}=0$. That might be evaluated with a single F-test comparing 2 models, one with and one without the interaction term. Let's assume that the assumptions of the model are met and that this isn't a "spuriously significant interaction" as @BruceET warns.

Nevertheless, as he also points out, the single F-test of $\beta_{GS}=0$ isn't the same as the test of all 6 pairwise differences among the 4 group/sex combinations that you evidently performed. In particular, the multiple-testing correction makes it harder to rule out that you made no false-positive errors in that set of comparisons. Your Bonferroni correction is particularly (and unnecessarily) strict in that way. For 6 comparisons you need to have p < 0.0083 to establish "signficance" while maintaining a family-wise error rate of p < 0.05.

It's not clear why you needed to do all pairwise comparisons. If you do need to evaluate all pairwise comparisons, then a more powerful test like the Tukey HSD, or at least the Holm modification of the Bonferroni correction, would be better. Even then, as @BruceET put it, "there is no guarantee that even real differences among population means detected by ANOVA will be resolved post hoc."

This type of thing usually happens when the interaction coefficient $\beta_{GS}=0$ is fairly small in magnitude even if "significant" by the usual p < 0.05 criterion. Then you need to apply your knowledge of the subject matter to interpret the results fairly. You might be able to say that the associations of group and sex with outcome aren't strictly additive, but you might not be able to specify just which differences among group/sex combinations are "significantly" different.