Solved – Multiple comparisons in mixed effects model

false-discovery-ratefamilywise-errorlme4-nlmemixed modelr

tl;dr In a random-slopes model, how should one adjust for multiple comparisons when performing inference on the group-specific slopes (the BLUPs)?

Note 1: Bretz et al, the R package 'multcomp', and several other questions on this site deal with multiple comparisons in the context of the fixed effects in mixed-effect models. This question is about the random effects.

Note 2: this question is easiest to ask using the well-developed frequentist vocabulary surrounding surrounding false-discovery rates and multiple comparisons. However, I am equally interested in answers that can provide some Bayesian perspective on this problem: how to temper interpretation of credible intervals in light of the multiple comparisons issue?

Note 3: I've edited the question substantially thanks to helpful comments from Alexis.

THE QUESTION:

Suppose I fit a well-specified random slopes model to data with N groups. I wish to perform inference on whether these slopes (the BLUPs) differ from zero while controlling the false discovery rate. Is it possible to construct a more powerful test than we would achieve using standard p-value adjustments?

Several lines of reasoning suggest that it should be possible, at least in some cases. Below, I present some simulation results suggesting that this is the case, but interpreting the simulations is complicated, for reasons that I'll discuss. First, some conceptual ramblings:

  1. It's useful to distinguish two cases: one where the true value for every group-level slope is zero, and another where it is not. In the first case (every slope is zero) lme4 (and possible other mixed modeling software?) will likely conclude that the random-effect variance is zero, the model collapses to a global-slope model, and the multiple comparisons problem disappears. Thus, counterintuitively, we will face primarily the multiple comparisons problem when there is good evidence in the data for variation in the BLUPs. This, in turn, tends to happen only when some of the true effect sizes are nonzero.

  2. If the random-effects mean is statistically indistinguishable from zero, perhaps one could perform a likelihood ratio test to determine whether the random effect belongs in the model at all. If we can reject the null hypothesis that all groups are equal (i.e. that the random effect is superfluous), then it must be the case that at least some of the group-level slopes are nonzero. And then we might follow-up with some sort of post-hoc test?

  3. The random-slopes model provides a shrinkage estimator for the slopes. If the true random-effect mean (i.e. the fixed effect in lme4 parlance) is zero, this shrinkage should tend to make it harder to reject a true null. Here, Gelman explores this topic in a Bayesian context. If, on the other hand, the true random-effect mean is nonzero, this should make it relatively frequent to reject a true null, because groups with no true effect will tend to get pulled towards the overall mean.

  4. If we want to study the issue using simulation, there's a bit of a problem. We need to inject a large number of true nulls into a single model in order to study the behavior. At the same time, we need to inject some groups with true effects to force the model to estimate a nonzero random effects variance. When we do both of these things, the true distribution of the effect sizes is no longer normal, which is a form of lack-of-fit in the model.

The above caveats (especially in point #4) notwithstanding, I've done a small simulation study with R-code below. Feel free to play around with varying Ng and SSpg in the code, but the really important parameter is mTE, the mean effect size among those groups that do not correspond to true nulls.

With mTE <- 0, we find that indeed the true nulls yield significance at a nominal alpha-value of 0.05 less than 5% of the time. However, with mTE <- 5, and using the default values for Ng and SSpg in the code, these same groups produce Type 1 errors 100% of the time.

Code below deals with a random intercepts model for computational efficiency. The core conceptual issues are the same as in a random slopes model, I think.

library(lme4)

set.seed(1)

type1 <- vector()
type2 <- vector()

Ng <- 400   # Must be an even number
            # We will split this number of groups in half.  One half will have true effect sizes of zero; these
              # are the true nulls.  The other half will have normally distributed effect sizes (these are
              # necessary to include in order to ensure that we estimate non-zero random effect variance).
SSpg = 20 # Sample size per group
mTE <- 0 # The mean of the true effect size for the groups that do not correspond to true nulls.

for(i in 1:100){
print(i)
groupmeans <- c((rnorm(Ng/2)+mTE), rep(0,Ng/2))
# In this model, there are four hundred groups (to provide a reasonable sample size for calculating the FDR)
# The second 200 all have effect sizes of zero; these are the true nulls that might be subject to Type 1 error.
# The first 200 obey the random-effects specification. Including some groups like this is necessary to prevent
# the model from estimating zero random effect variance.

testdata <- as.data.frame(matrix(data = NA, nrow=Ng*SSpg, ncol=2))
colnames(testdata) <- c('group', 'y')

testdata$group <- rep(c(1:Ng), SSpg)
testdata$y <- rnorm(Ng*SSpg, groupmeans[testdata$group])

my.model <- lmer(y ~ (1|group), data = testdata)

# Get p-values for each group's effect using a normal approximation.

cV <- ranef(my.model, condVar = TRUE, drop=T)  
ranvar <- attr(cV[[1]], "postVar")
allvar <- ranvar + diag(vcov(my.model))
BLUPsd <- sqrt(allvar[1])

# check which group-level effects are significantly different from zero.
sig <- which(cV$group + fixef(my.model)  - 1.96*BLUPsd > 0 | cV$group + 1.96*BLUPsd < 0)
type1[i] <- length(which(sig > (Ng/2)))
}

summary(type1)
hist(type1)
Ng/40
# Expected number of false discoveries if p-values are uniformly distributed and independent over the 
# is Ng/40 true nulls

As Amoeba points out, the very idea of getting p-values from random effect BLUPs is not standard, as has been discussed here. The objections raised in the previous link are unfamiliar to me; from my (generally Bayesian) perspective I see nothing wrong with the idea of a credible interval around an individual group mean in a hierarchical model.

Bolker provides a sketch of how to get confidence intervals for the BLUPs from lme4, and the above computation of p-values is based on treating these "confidence intervals" uncritically as standard-fare confidence intervals. Briefly, we take the variance in the conditional mean for the BLUP (note that because this model is normal, the conditional mean and the conditional mode are one in the same), add to it the conditional variance in the intercept, and compute the z-score and p-value in just the way one might expect. I don't actually know whether the use of a normal distribution here is exact (because the whole model is Gaussian) or approximate. Note that some of the subtleties and pitfalls inherent in doing this actually involve covariances between random effects that are not present in this model. However, the approach used here does NOT account for the non-independence between the fixed intercept estimate and the random intercept estimates. I'm pretty sure that if anything, this would make my p-values anti-conservative, so the fact that the type 1 error rate is less than alpha (iff the true mean is close to zero) is still noteworthy.

Best Answer

If one of your model p-values has the meaning "probability of observing a parameter estimate that is as or more extreme than the one estimated, assuming the null hypothesis is true" (i.e. the standard interpretation of p-values), then you need to adjust for multiple comparisons.

This is because the meaning of "the probability of rejecting the null hypothesis when the null hypothesis is true" (given some preferred Type I error rate $\alpha$), is no longer coherent when there is no longer "the null hypothesis," but rather more than one null hypothesis.

That said: go with false discovery rate methods—almost certainly the Benjamini-Hochberg procedure (rather than the Benjamini-Yekutieli as a violation of positive dependence under regression is difficult to conceive)—as they (1) are not sensitive to/don't require a conceptually undefined "family", and (2) adapt to new information about the probability that a null hypothesis is true (i.e. if you have rejected some number of tests already, you should no longer believe 100% that all your remaining null hypotheses are true).