Solved – Multiple comparisons in mixed effects model

false-discovery-ratefamilywise-errorlme4-nlmemixed modelr

tl;dr In a random-slopes model, how should one adjust for multiple comparisons when performing inference on the group-specific slopes (the BLUPs)?

Note 1: Bretz et al, the R package 'multcomp', and several other questions on this site deal with multiple comparisons in the context of the fixed effects in mixed-effect models. This question is about the random effects.

Note 2: this question is easiest to ask using the well-developed frequentist vocabulary surrounding surrounding false-discovery rates and multiple comparisons. However, I am equally interested in answers that can provide some Bayesian perspective on this problem: how to temper interpretation of credible intervals in light of the multiple comparisons issue?

Note 3: I've edited the question substantially thanks to helpful comments from Alexis.

THE QUESTION:

Suppose I fit a well-specified random slopes model to data with N groups. I wish to perform inference on whether these slopes (the BLUPs) differ from zero while controlling the false discovery rate. Is it possible to construct a more powerful test than we would achieve using standard p-value adjustments?

Several lines of reasoning suggest that it should be possible, at least in some cases. Below, I present some simulation results suggesting that this is the case, but interpreting the simulations is complicated, for reasons that I'll discuss. First, some conceptual ramblings:

It's useful to distinguish two cases: one where the true value for every group-level slope is zero, and another where it is not. In the first case (every slope is zero) lme4 (and possible other mixed modeling software?) will likely conclude that the random-effect variance is zero, the model collapses to a global-slope model, and the multiple comparisons problem disappears. Thus, counterintuitively, we will face primarily the multiple comparisons problem when there is good evidence in the data for variation in the BLUPs. This, in turn, tends to happen only when some of the true effect sizes are nonzero.
If the random-effects mean is statistically indistinguishable from zero, perhaps one could perform a likelihood ratio test to determine whether the random effect belongs in the model at all. If we can reject the null hypothesis that all groups are equal (i.e. that the random effect is superfluous), then it must be the case that at least some of the group-level slopes are nonzero. And then we might follow-up with some sort of post-hoc test?
The random-slopes model provides a shrinkage estimator for the slopes. If the true random-effect mean (i.e. the fixed effect in lme4 parlance) is zero, this shrinkage should tend to make it harder to reject a true null. Here, Gelman explores this topic in a Bayesian context. If, on the other hand, the true random-effect mean is nonzero, this should make it relatively frequent to reject a true null, because groups with no true effect will tend to get pulled towards the overall mean.
If we want to study the issue using simulation, there's a bit of a problem. We need to inject a large number of true nulls into a single model in order to study the behavior. At the same time, we need to inject some groups with true effects to force the model to estimate a nonzero random effects variance. When we do both of these things, the true distribution of the effect sizes is no longer normal, which is a form of lack-of-fit in the model.

The above caveats (especially in point #4) notwithstanding, I've done a small simulation study with R-code below. Feel free to play around with varying Ng and SSpg in the code, but the really important parameter is mTE, the mean effect size among those groups that do not correspond to true nulls.

With mTE <- 0, we find that indeed the true nulls yield significance at a nominal alpha-value of 0.05 less than 5% of the time. However, with mTE <- 5, and using the default values for Ng and SSpg in the code, these same groups produce Type 1 errors 100% of the time.

Code below deals with a random intercepts model for computational efficiency. The core conceptual issues are the same as in a random slopes model, I think.

library(lme4)

set.seed(1)

type1 <- vector()
type2 <- vector()

Ng <- 400   # Must be an even number
            # We will split this number of groups in half.  One half will have true effect sizes of zero; these
              # are the true nulls.  The other half will have normally distributed effect sizes (these are
              # necessary to include in order to ensure that we estimate non-zero random effect variance).
SSpg = 20 # Sample size per group
mTE <- 0 # The mean of the true effect size for the groups that do not correspond to true nulls.

for(i in 1:100){
print(i)
groupmeans <- c((rnorm(Ng/2)+mTE), rep(0,Ng/2))
# In this model, there are four hundred groups (to provide a reasonable sample size for calculating the FDR)
# The second 200 all have effect sizes of zero; these are the true nulls that might be subject to Type 1 error.
# The first 200 obey the random-effects specification. Including some groups like this is necessary to prevent
# the model from estimating zero random effect variance.

testdata <- as.data.frame(matrix(data = NA, nrow=Ng*SSpg, ncol=2))
colnames(testdata) <- c('group', 'y')

testdata$group <- rep(c(1:Ng), SSpg)
testdata$y <- rnorm(Ng*SSpg, groupmeans[testdata$group])

my.model <- lmer(y ~ (1|group), data = testdata)

# Get p-values for each group's effect using a normal approximation.

cV <- ranef(my.model, condVar = TRUE, drop=T)  
ranvar <- attr(cV[[1]], "postVar")
allvar <- ranvar + diag(vcov(my.model))
BLUPsd <- sqrt(allvar[1])

# check which group-level effects are significantly different from zero.
sig <- which(cV$group + fixef(my.model)  - 1.96*BLUPsd > 0 | cV$group + 1.96*BLUPsd < 0)
type1[i] <- length(which(sig > (Ng/2)))
}

summary(type1)
hist(type1)
Ng/40
# Expected number of false discoveries if p-values are uniformly distributed and independent over the 
# is Ng/40 true nulls

As Amoeba points out, the very idea of getting p-values from random effect BLUPs is not standard, as has been discussed here. The objections raised in the previous link are unfamiliar to me; from my (generally Bayesian) perspective I see nothing wrong with the idea of a credible interval around an individual group mean in a hierarchical model.

Bolker provides a sketch of how to get confidence intervals for the BLUPs from lme4, and the above computation of p-values is based on treating these "confidence intervals" uncritically as standard-fare confidence intervals. Briefly, we take the variance in the conditional mean for the BLUP (note that because this model is normal, the conditional mean and the conditional mode are one in the same), add to it the conditional variance in the intercept, and compute the z-score and p-value in just the way one might expect. I don't actually know whether the use of a normal distribution here is exact (because the whole model is Gaussian) or approximate. Note that some of the subtleties and pitfalls inherent in doing this actually involve covariances between random effects that are not present in this model. However, the approach used here does NOT account for the non-independence between the fixed intercept estimate and the random intercept estimates. I'm pretty sure that if anything, this would make my p-values anti-conservative, so the fact that the type 1 error rate is less than alpha (iff the true mean is close to zero) is still noteworthy.

Best Answer

If one of your model p-values has the meaning "probability of observing a parameter estimate that is as or more extreme than the one estimated, assuming the null hypothesis is true" (i.e. the standard interpretation of p-values), then you need to adjust for multiple comparisons.

This is because the meaning of "the probability of rejecting the null hypothesis when the null hypothesis is true" (given some preferred Type I error rate $\alpha$), is no longer coherent when there is no longer "the null hypothesis," but rather more than one null hypothesis.

That said: go with false discovery rate methods—almost certainly the Benjamini-Hochberg procedure (rather than the Benjamini-Yekutieli as a violation of positive dependence under regression is difficult to conceive)—as they (1) are not sensitive to/don't require a conceptually undefined "family", and (2) adapt to new information about the probability that a null hypothesis is true (i.e. if you have rejected some number of tests already, you should no longer believe 100% that all your remaining null hypotheses are true).

Related Solutions

Solved – Multiple comparisons on a mixed effects model

If time and Genotype are both categorical predictors as they appear to be, and you are interested in comparing all time/Genotype pairs to each other, then you can just create one interaction variable, and use Tukey contrasts on it:

weights$TimeGeno <- interaction(weigths$Time, weights$Geno)
model <- lme(weight ~ TimeGeno, random = ~1|Animal/time, data=weights) 
comp.timegeno <- glht(model, linfct=mcp(TimeGeno="Tukey"))

If you are interested in other contrasts, then you can use the fact that the linfct argument can take a matrix of coefficients for the contrasts - this way you can set up exactly the comparisons you want.

EDIT

There appears some concern in the comments that the model fitted with the TimeGeno predictor is different from the original model fitted with the Time * Genotype predictor. This is not the case, the models are equivalent. The only difference is in the parametrization of the fixed effects, which is set up to make it easier to use the glht function.

I have used one of the built-in datasets (it has Diet instead of Genotype) to demonstrate that the two approaches have the same likelihood, predicted values, etc:

> # extract a subset of a built-in dataset for the example
> data(BodyWeight)
> ex <- as.data.frame(subset(BodyWeight, Time %in% c(1, 22, 44)))
> ex$Time <- factor(ex$Time)
> 
> #create interaction variable
> ex$TimeDiet <- interaction(ex$Time, ex$Diet)
    > 
    > model1 <- lme(weight ~ Time * Diet, random = ~1|Rat/Time,  data=ex)    
    > model2 <- lme(weight ~ TimeDiet, random = ~1|Rat/Time, data=ex)    
    > 
    > # the degrees of freedom, AIC, BIC, log-likelihood are all the same 
    > anova(model1, model2)
           Model df      AIC      BIC    logLik
    model1     1 12 367.4266 387.3893 -171.7133
    model2     2 12 367.4266 387.3893 -171.7133
    Warning message:
    In anova.lme(model1, model2) :
      fitted objects with different fixed effects. REML comparisons are not meaningful.
    > 
    > # the second model collapses the main and interaction effects of the first model
    > anova(model1)
                numDF denDF   F-value p-value
    (Intercept)     1    26 1719.5059  <.0001
    Time            2    26   28.9986  <.0001
    Diet            2    13   85.3659  <.0001
    Time:Diet       4    26    1.7610  0.1671
    > anova(model2)
                numDF denDF   F-value p-value
    (Intercept)     1    24 1719.5059  <.0001
    TimeDiet        8    24   29.4716  <.0001
    > 
    > # they give the same predicted values
    > newdata <- expand.grid(Time=levels(ex$Time), Diet=levels(ex$Diet))
    > newdata$TimeDiet <- interaction(newdata$Time, newdata$Diet)
> newdata$pred1 <- predict(model1, newdata=newdata, level=0)
    > newdata$pred2 <- predict(model2, newdata=newdata, level=0)
> newdata
  Time Diet TimeDiet   pred1   pred2
1    1    1      1.1 250.625 250.625
2   22    1     22.1 261.875 261.875
3   44    1     44.1 267.250 267.250
4    1    2      1.2 453.750 453.750
5   22    2     22.2 475.000 475.000
6   44    2     44.2 488.750 488.750
7    1    3      1.3 508.750 508.750
8   22    3     22.3 518.250 518.250
9   44    3     44.3 530.000 530.000

The only difference is that what hypotheses are easy to test. For example, in the first model it is easy to test whether the two predictors interact, in the second model there is no explicit test for this. On the other hand, the joint effect of the two predictors is easy to test in the second model, but not the first one. The other hypotheses are testable, it is just more work to set those up.

Solved – Allowed comparisons of mixed effects models (random effects primarily)

Using maximum likelihood, any of these can be compared with AIC; if the fixed effects are the same (m1 to m4), using either REML or ML is fine, with REML usually preferred, but if they are different, only ML can be used. However, interpretation is usually difficult when both fixed effects and random effects are changing, so in practice, most recommend changing only one or the other at a time.

Using the likelihood ratio test is possible but messy because the usual chi-squared approximation doesn't hold when testing if a variance component is zero. See Aniko's answer for details. (Kudos to Aniko for both reading the question more carefully than I did and reading my original answer carefully enough to notice that it missed this point. Thanks!)

Pinhiero/Bates is the classic reference; it describes the nlme package, but the theory is the same. Well, mostly the same; Doug Bates has changed his recommendations on inference since writing that book and the new recommendations are reflected in the lme4 package. But that's more than I want to get into here. A more readable reference is Weiss (2005), Modeling Longitudinal Data.

Best Answer

Related Solutions

Solved – Multiple comparisons on a mixed effects model

Solved – Allowed comparisons of mixed effects models (random effects primarily)

Related Question