There's no use for a post hoc test here. What could you possibly find in a post hoc that isn't obvious from the ANOVA? There's a main effect of A and a main effect of B and an interaction.
Some people do them to do something like test whether A1B1 - A1B2 is significant while A2B1 - A2B2 is not. They find that result and report it as important but it's meaningless because it tells you less than the interaction already told you. Significant and not-significant is not a test of differences between conditions dependent upon the level of the other condition. The interaction already was that test.
I find it easiest to think about this in the form of a linear regression. Let $Y$ be the continuous outcome, $G$ represent the group (value of 0 for the reference group and 1 for the other), and $S$ represent sex (0 for male, 1 for female). Then your model is:
$$ Y = \beta_0 + \beta_G G + \beta_S S + \beta_{GS}GS$$
with intercept $\beta_0$ the estimated outcome when $G=0$ (reference group) and $S=0$ (male). A "significant" interaction means that you don't accept the null hypothesis of $\beta_{GS}=0$. That might be evaluated with a single F-test comparing 2 models, one with and one without the interaction term. Let's assume that the assumptions of the model are met and that this isn't a "spuriously significant interaction" as @BruceET warns.
Nevertheless, as he also points out, the single F-test of $\beta_{GS}=0$ isn't the same as the test of all 6 pairwise differences among the 4 group/sex combinations that you evidently performed. In particular, the multiple-testing correction makes it harder to rule out that you made no false-positive errors in that set of comparisons. Your Bonferroni correction is particularly (and unnecessarily) strict in that way. For 6 comparisons you need to have p < 0.0083 to establish "signficance" while maintaining a family-wise error rate of p < 0.05.
It's not clear why you needed to do all pairwise comparisons. If you do need to evaluate all pairwise comparisons, then a more powerful test like the Tukey HSD, or at least the Holm modification of the Bonferroni correction, would be better. Even then, as @BruceET put it, "there is no guarantee that even real differences among population means detected by ANOVA will be resolved post hoc."
This type of thing usually happens when the interaction coefficient $\beta_{GS}=0$ is fairly small in magnitude even if "significant" by the usual p < 0.05 criterion. Then you need to apply your knowledge of the subject matter to interpret the results fairly. You might be able to say that the associations of group
and sex
with outcome aren't strictly additive, but you might not be able to specify just which differences among group/sex
combinations are "significantly" different.
Best Answer
Comparing means using side-by-side confidence intervals may seem straightforward, but it is in fact inappropriate, so please don’t do it.
Sample means are examples of statistics; and differences between sample means are other statistics. These two kinds of statistics have different sampling distributions and different standard errors — sometimes dramatically different. There are many easy-to-construct examples where CIs overlap substantially but the differences are highly significant. It’s harder to construct an example of the reverse case, but it’s possible.
Repeat: One should not use CIs for means to test differences between them.