Solved – Q-Test of Heterogeneity for only two effect sizes

cochran-qeffect-sizeheterogeneitymeta-analysis

I am using the Q-Test of Heterogeneity to investigate if several effect sizes derived from two different studies are significantly different from each other.

More precisely, I have, for every effect of interest, a Cohen's d from one study, and one from the other, and their respective variances. I calculate the Q-value (using the formula from Cochran 1954), and if it is higher than the critical value of the chi-squared-distribution with 1 df (3,84), I assume the two effect sizes to be significantly different.

Now my question is: are there any statistical or theoretical objections to this method?

I understand that the Q-Test is normally used in Meta-analyses to make sure that effect sizes are homogenous, not heterogenous. But can it also be "abused" in the way I am proposing?
Also, I read in several sources that the Q-test is lacking in power when applied on small sample sizes. However, I used it with the smallest possible sample size (2 studies) and found a number of highly significant and theoretically interesting differences. Is there any statistical problem with my approach that I am missing?

As I would like to imply this procedure in my master thesis, I would also be incredibly thankful if somebody could give me a reference on this topic, for example a paper with a critical discussion of this procedure, or a study where it was successfully applied. (Unfortunately, the statistical training at my university hasn't even scratched the surface of the wide field of meta-analysis…)

Thank you a lot in advance!

Best Answer

First of all, some terminology. In my opinion, it is arbitrary whether we call it the "Q-test of homogeneity" or the "Q-test for heterogeneity". Under the null hypothesis, we assume homogeneity, so calling it the Q-test of homogeneity would emphasize that we are testing this assumption. But the alternative hypothesis states that the true effects/outcomes are heterogeneous, so we could also say that we using it to test for (whether) heterogeneity (is present). Some may disagree -- but in the end, there is just the Q-test, whatever we call it.

As for using the Q-test when there are only two studies: That is equivalent to testing the null hypothesis that the true effect for the first study is the same as the true effect for the second study. That should be clear if we write down the null hypothesis for the Q-test, namely $$H_0: \theta_i = \theta \mbox{ for all } i = 1, \ldots, k,$$ which for $k = 2$ is identical to $$H_0: \theta_1 = \theta_2,$$ where $\theta_i$ denotes the true effect/outcome for the $i$th study.

We can also demonstrate this with an example. I'll use R with the metafor package for this.

library(metafor)

yi <- c(.14, .75)   # observed effect size estimates
vi <- c(.083, .042) # corresponding sampling variances
di <- c(0, 1)       # dummy variables to distinguish the two studies

### fixed-effects model
rma(yi, vi, method="FE")

The results are:

Fixed-Effects Model (k = 2)

Test for Heterogeneity: 
Q(df = 1) = 2.9768, p-val = 0.0845

Model Results:

estimate       se     zval     pval    ci.lb    ci.ub          
  0.5450   0.1670   3.2638   0.0011   0.2177   0.8723       ** 

---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So, we find $Q(1) = 2.9768$ with $p = 0.0845$. Now let's fit a (fixed-effects) meta-regression to these data, adding the dummy variable to the model that distinguishes the first from the second study:

### meta-regression model
rma(yi, vi, mods = ~ di, method="FE")

This yields:

Fixed-Effects with Moderators Model (k = 2)

Test for Residual Heterogeneity: 
QE(df = 0) = 0.0000, p-val = 1.0000

Test of Moderators (coefficient(s) 2): 
QM(df = 1) = 2.9768, p-val = 0.0845

Model Results:

         estimate      se    zval    pval    ci.lb   ci.ub   
intrcpt    0.1400  0.2881  0.4859  0.6270  -0.4247  0.7047   
di         0.6100  0.3536  1.7253  0.0845  -0.0830  1.3030  .

---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Note that the p-value of the dummy variable is exactly the same as that of the Q-test. If we square the z-value, we in fact get the chi-square value of the Q-test. That is actually the value of the $Q_M$ test, which is the omnibus test of all coefficients (except for the intercept), but since there is only one coefficient in the model (the one for the dummy variable), that is identical to just testing whether the coefficient for the dummy variable is significantly different from zero.

So, "mechanically speaking", that all works out as expected. Is this a valid procedure? I don't see anything wrong with this -- you are just testing whether the effects of study 1 and study 2 are significantly different from each other or not.

Power is of course always an issue we have to keep in mind. And yes, power may be low. So, quite importantly, if you do not find a significant difference, you should be very cautious how you interpret that. In particular, it simply means that you do not have sufficient evidence to reject the null hypothesis, but the true effects could very well still differ from each other.

On the other hand, if you use this approach to carry out lots of tests, it is in fact likely that some significant findings are just Type I errors. So, you also need to be cautious in interpreting any significant findings, especially if you do not correct for multiple testing.

Related Question