Solved – Pairwise comparisons after significant interaction results: parametric or non

anovanonparametricpost-hoc

I've run a 2-way ANOVA on growth rate data (grams/day) with the factors of year type (good and poor) and site (A and B). Though the data themselves are non-normal and do not have homogeneous variances, the residuals fall pretty nicely along the qq plot, and they are not heteroskedastic. Running the test shows that there is an interaction between year-type and site. It's been suggested to me that I now must run a series of pairwise comparisons to look for differences because of this interaction effect, which I assumed I'd need to do anyway.

My stats program (Sigmaplot11, which includes the SigmaStat package) automatically does post-hoc tests for any significant results (in this case is reports: "All Pairwise Multiple Comparison Procedures (Holm-Sidak method): Overall significance level = 0.05). I assume that this is the correct way to conduct the pairwise comparisons rather than making them separately?

Here's why I ask: Using these built-in post-hoc tests, I get significant results in both good years and poor (p<0.001 in each case). However, doing these tests separately (good vs. good and poor vs. poor) as Mann-Whitney rank sum tests (the raw data are non-normal with heterogeneous variances), I get no significant difference between sites within poor years. t-tests show the same as the ANOVA. I assume this is because I'm not comparing means in the case of the Mann-Whitney, but which post-hoc method should I be using in this case?!

EDIT: following @whuber's informative comment, I've taken a look at my other 2-way ANOVAs run in a similar manner. I've done another similar test today comparing adult weight. The t-test from the multiple pairwise comparisons after the 2-way ANOVA shows no difference (t=1.547, p=0.122), but a t-test run outside of the ANOVA shows highly significant difference (t=-4.739, p<0.001). As @chl pointed out, I would expect these 2 t-tests to have the same t-values; note that these 2 tests use exactly the same original data. Any idea why this might be or how I can interpret this?

Thanks for any suggestions you can provide!!

EDIT #2: Just to update anyone who's interested, I've taken a look more closely at the numbers behind the test, and it looks like somehow the software is not doing what I'm asking. It lists a table called the Least Square Means, listing each site, its mean and its standard error of the mean. The overall site means listed in this table are not correct. However, for each site within year type, it does have the correct means in the Least Square Means table. I'm giving up on the main factor of site comparison within this test, and sticking to the t-value given in the t-test for this main factor comparison. I'm still not sure what's going on exactly, but from comments and answers provided (thanks again!), I feel that this is a safe move, and I must move on with other things.
Thanks for all of your help!

Best Answer

If I understand your question correctly, you are wondering why you got different p-values from your t-tests when they are carried out as post-hoc tests or as separate tests. But did you control the FWER in the second case (because this is what id done with the step down Sidak-Holm method)? Because, in case of simple t-tests, the t-values won't change, unless you use a different pooling method for computing variance at the denominator, but the p-value of the unprotected tests will be lower than the corrected one.

This is easily seen with Bonferroni adjustment, since we multiply the observed p-value by the number of tests. With step-down methods like Holm-Sidak, the idea is rather to sort the null hypothesis tests by increasing p-values and correct the alpha value with Sidak correction factor in a stepwise manner ($\alpha’ = 1 - (1 - \alpha)^k$, with $k$ the number of possible comparisons, updated after each step). Note that, in contrast to Bonferroni-Holm's method, control of the FWER is only guaranteed when comparisons are independent. A more detailed description of the different kind of correction for multiple comparisons is available here: Pairwise Comparisons in SAS and SPSS.

Related Solutions

ANOVA Significance – Can It Be Significant When Pairwise T-Tests Are Not

Note: There was something wrong with my original example. I stupidly got caught by R's silent argument recycling. My new example is quite similar to my old one. Hopefully everything is right now.

Here's an example I made that has the ANOVA significant at the 5% level but none of the 6 pairwise comparisons are significant, even at the 5% level.

Here's the data:

g1:  10.71871  10.42931   9.46897   9.87644
g2:  10.64672   9.71863  10.04724  10.32505  10.22259  10.18082  10.76919  10.65447 
g3:  10.90556  10.94722  10.78947  10.96914  10.37724  10.81035  10.79333   9.94447 
g4:  10.81105  10.58746  10.96241  10.59571

enter image description here

Here's the ANOVA:

             Df Sum Sq Mean Sq F value Pr(>F)  
as.factor(g)  3  1.341  0.4469   3.191 0.0458 *
Residuals    20  2.800  0.1400

Here's the two sample t-test p-values (equal variance assumption):

        g2     g3     g4
 g1   0.4680 0.0543 0.0809 
 g2          0.0550 0.0543 
 g3                 0.8108

With a little more fiddling with group means or individual points, the difference in significance could be made more striking (in that I could make the first p-value smaller and the lowest of the set of six p-values for the t-test higher).

Edit: Here's an additional example that was originally generated with noise about a trend, which shows how much better you can do if you move points around a little:

g1:  7.27374 10.31746 10.54047  9.76779
g2: 10.33672 11.33857 10.53057 11.13335 10.42108  9.97780 10.45676 10.16201
g3: 10.13160 10.79660  9.64026 10.74844 10.51241 11.08612 10.58339 10.86740
g4: 10.88055 13.47504 11.87896 10.11403

The F has a p-value below 3% and none of the t's has a p-value below 8%. (For a 3 group example - but with a somewhat larger p-value on the F - omit the second group)

And here's a really simple, if more artificial, example with 3 groups:

g1: 1.0  2.1
g2: 2.15 2.3 3.0 3.7 3.85
g3: 3.9  5.0

(In this case, the largest variance is on the middle group - but because of the larger sample size there, the standard error of the group mean is still smaller)

Multiple comparisons t-tests

whuber suggested I consider the multiple comparisons case. It proves to be quite interesting.

The case for multiple comparisons (all conducted at the original significance level - i.e. without adjusting alpha for multiple comparisons) is somewhat more difficult to achieve, as playing around with larger and smaller variances or more and fewer d.f. in the different groups don't help in the same way as they do with ordinary two-sample t-tests.

However, we do still have the tools of manipulating the number of groups and the significance level; if we choose more groups and smaller significance levels, it again becomes relatively straightforward to identify cases. Here's one:

Take eight groups with $n_i=2$. Define the values in the first four groups to be (2,2.5) and in the last four groups to be (3.5,4), and take $\alpha=0.0025$ (say). Then we have a significant F:

> summary(aov(values~ind,gs2))
            Df Sum Sq Mean Sq F value  Pr(>F)   
ind          7      9   1.286   10.29 0.00191 
Residuals    8      1   0.125

Yet the smallest p-value on the pairwise comparisons is not significant that that level:

> with(gs2,pairwise.t.test(values,ind,p.adjust.method="none"))

        Pairwise comparisons using t tests with pooled SD 

data:  values and ind 

   g1     g2     g3     g4     g5     g6     g7    
g2 1.0000 -      -      -      -      -      -     
g3 1.0000 1.0000 -      -      -      -      -     
g4 1.0000 1.0000 1.0000 -      -      -      -     
g5 0.0028 0.0028 0.0028 0.0028 -      -      -     
g6 0.0028 0.0028 0.0028 0.0028 1.0000 -      -     
g7 0.0028 0.0028 0.0028 0.0028 1.0000 1.0000 -     
g8 0.0028 0.0028 0.0028 0.0028 1.0000 1.0000 1.0000

P value adjustment method: none

Solved – Doing post-hoc after a not significant interaction in mixed ANOVA

As a reviewer there would be several things here that would concern me.

Assuming we were looking at the set of possible two-way interactions in your post-hocs (the next rational step in a decomposition from a three-way interaction), then a significant effect for one two-way interaction (but not for the others) would not necessitate a three way interaction per se. For example, one two-way interaction may have a statistically significant effect size greater than 0 and the others may have effects in the same direction, but not large enough to be greater than 0. Nevertheless, because all are going in the same direction, then there might not be sufficient evidence to suggest that they are sufficiently different from each other to reject the null hypothesis that they are the same (i.e., not a statistically significant three-way interaction).

That being said, I don't see your post-hocs here as testing the differences between two-way interactions (i.e. differences in the differences). You seem to be testing a subset of possible main effects (differences manipulating only a single variable while holding the levels of other variables fixed). For example, none of your comparisons involve both the Experimental and Control groups.

What does your result actually indicate? I think it indicates a statistically significant difference between those two particular conditions (Control, 1, 1 and Control, 2, 1).

Regardless, you should know that your lack of a three-way interaction here is probably not a power issue. If it were simply a power issue, then the F ratio for your three way interaction would exceed 1. As it is, there is less variance in the three-way interaction that would be expected on average if the null hypothesis were true.

Finally, assuming the comparisons you did perform were of interest, then I would expect to see the comparisons done as a priori... a planned post-hoc makes no sense to me. That being said, I also know some reviewers are very post-hoc correction happy. The most important part here is that I would want to see those results interpreted appropriately (and not alluded to as a three way interaction).

Edit: Oh, and I should acknowledge that I've seen plenty of people interpret significant results consistent with a desired interaction as being evidence in strong favor of the interaction. I've even seen this in top tier journals. That being said, I strongly recommend against it (then again, I have a particular problem with this misbehavior, c.f. https://stats.stackexchange.com/a/4572/196).

Best Answer

Related Solutions

ANOVA Significance – Can It Be Significant When Pairwise T-Tests Are Not

Solved – Doing post-hoc after a not significant interaction in mixed ANOVA

Related Question