Here are some suggestions for how you could respond:
- Given that your predictors are continuous (and artificially categorizing predictors is usually frowned upon), meta-regression seems like the right approach (and in fact, meta-regression can also deal with categorical predictors just as well as stratifying on them, so why bother?).
- If I understand you correctly, you entered those 4 predictors simultaneously into the model. Stratifying would either imply examining one (artificially categorized) predictor at a time (which does not consider heterogeneity that may be better accounted for by other predictors or potential confounding between the predictors) or if one were to start stratifying on combinations of predictors, the subgroups will start to get quite small. That doesn't seem like a good idea (see also the next point).
- How well the amount of heterogeneity in a random-effects model (or the amount of residual heterogeneity in a mixed-effects meta-regression model) is estimated depends to a great extent on the number of studies. Stratifying will lead to smaller subsets with poorer estimates of the amount of heterogeneity.
I actually discuss these issues in this article:
Viechtbauer, W. (2007). Accounting for heterogeneity via random-effects models and moderator analyses in meta-analysis. Zeitschrift für Psychologie / Journal of Psychology, 215(2), 104-121.
If you are interested and cannot get hold of a copy of the article (it's in a German journal; but the article is in English), feel free to send me an e-mail (you'll find my website linked to from my profile; e-mail address can be found there).
As you note, the model that adds random effects for each study and random effects for each outcome is a model that accounts for hierarchical dependence. This model allows the true outcomes/effects within a study to be correlated. This is the Konstantopoulos (2011) example you link to.
But this model still assumes that the sampling errors of the observed outcomes/effects within a study are independent, which is definitely not the case when those outcomes are assessed within the same individuals. So, as in the Berkey et al. (1998) example you link to, ideally you need to construct the whole variance-covariance matrix of the sampling errors (with the sampling variances along the diagonal). The chapter by Gleser and Olkin (2009) from the Handbook of research synthesis and meta-analysis describes how the covariances can be computed for various outcomes measures (including standardized mean differences). The analyses/methods from that chapter are replicated here (you are dealing with the multiple-endpoint case).
And as you note, doing this requires knowing how the actual measurements within studies are correlated. Using your example, you would need to know for study 1 how strong the correlation was between the two measurements for "Phonological loop" (more accurately, there are two correlations, one for the first and one for the second group, but we typically assume that the correlation is the same for the two groups), and how strongly those measurements were correlated with the "Central Executive" measurements. So, three correlations in total.
Obtaining/extracting these correlations is often difficult, if not impossible (as they are often not reported). If you really cannot obtain them (even after contacting study authors in an attempt to obtain the missing information), there are several options:
One can still often make a rough/educated guess how large the correlations are. Then we use those 'guestimates' and conduct sensitivity analyses to ensure that conclusions remain unchanged when the values are varied within a reasonable range.
One could use robust methods -- in essence, we then consider the assumed variance-covariance matrix of the sampling errors to be misspecified (i.e., we assume it is diagonal, when in fact we know it isn't) and then estimate the variance-covariance matrix of the fixed effects (which are typically of primary interest) using consistent methods even under such a model misspecification. This is in essence the approach described by Hedges, Tipton, and Johnson (2010) that you mentioned.
Resampling methods (i.e., bootstrapping and permutation testing) may also work.
There are also some alternative models that try to circumvent the problem by means of some simplification of the model. Specifically, in the model/approach by Riley and colleagues (see, for example: Riley, Abrams, Lambert, Sutton, & Thompson, 2007, Statistics in Medicine, 26, 78-97), we assume that the correlation among the sampling errors is identical to the correlation among the underlying true effects, and then we just estimate that one correlation. This can work, but whether it does depends on how well that simplification matches up with reality.
There is always another option: Avoid any kind of statistical dependence via data reduction (e.g., selecting only one estimate, conducting separate analyses for different outcomes). This is still the most commonly used approach for 'handling' the problem, because it allows practitioners to stick to (relatively simple) models/methods/software they are already familiar with. But this approach can be wasteful and limits inference (e.g., if we conduct two separate meta-analyses for outcomes A and B, we cannot test whether the estimated effect is different for A and B unless we can again properly account for their covariance).
Note: The same issue was discussed on the R-sig-mixed-models mailing list and in essence I am repeating what I already posted there. See here.
For the robust method, you could try the robumeta package. If you want to stick to metafor
, you will find these, blog, posts by James Pustejovsky of interest. He is also working on another package, called clubSandwich which adds some additional small-sample corrections. You can also try the development version of metafor
(see here) -- it includes a new function called robust()
which you can use after you have fitted your model to obtain cluster robust tests and confidence intervals. And you can find some code to get you started with bootstrapping here.
Best Answer
If I understand your data, you have two beta's you want to average with meta-analysis. Each has a different standard error (and is based on a different sample size). To compute the average of these betas you are using inverse variance weighted meta-analysis. The significance of the mean beta is assessed with a z-test. In a fixed effects model, its standard error in the inverse of the sum of the weights (which are the inverse of the squared standard errors).
What moves forward to the meta-analysis from the simple OLS regression models is the betas and standard errors, not the t-tests and p-values (the latter two have no unique information not contained in the estimate and its standard error).
The reason that we don't use a t-test to assess the significance of the mean beta is that this beta is a meta-analytic mean and information on its precision comes from the estimates of the precision of the two betas on which it is based. There is an important caveat. This is assuming a fixed effect model. With only two estimates you cannot get any reasonable estimate of the between-study variance for a random effects model.