Solved – How to extract predicted probabilities from glmer results for a logistic mixed effects model

lme4-nlmelogisticmixed modelr

I have two groups that I follow over 4 time points (Baseline, Three months, Six months, and Year). The outcome is some binary variable, lets say presence or absence of cancer.

model <- glmer(Binary ~ Group + Time + Group:Time + (1 | Person), 
               data = test, family = binomial(), nAGQ = 10, 
               control = glmerControl(optimizer = "bobyqa"))

When I exponeniate the fixed effects log odds (with CIs), I get the following:

se <- sqrt(diag(vcov(model)))
output <- cbind(Estimate = fixef(model), LL = fixef(model) - 1.96 * se, 
                UL =  fixef(model) + 1.96 * se)
exp(output)
                      Estimate         LL         UL
(Intercept)              0.0293727 0.01996715 0.04320876
Group1                   3.8690905 2.47674209 6.04417441
TimeSixMos               1.7443529 1.13560801 2.67941654
TimeThreeMos             2.1692536 1.45871709 3.22589007
TimeYear                 3.1738210 2.06651131 4.87446620
Group1:TimeSixMos         0.7220189 0.38374473 1.35848475
Group1:TimeThreeMos       0.8459276 0.48017633 1.49027247
Group1:TimeYear           0.7085735 0.37362965 1.34378115

The probability from odds is odds / (1 + odds), but how can you calculate the predicted probability (of presence of cancer) for each group at each time point from this output? More so, how do you find the predicted probability for a mixed logistic model that uses categorical covariates?

Best Answer

You can have a look at the emmeans package that streamlines these calculations. But if you want to see how you could do it on your own, you could try something along these lines

DF <- with(test, expand.grid(Group = levels(Group), Time = levels(Time)))
X <- model.matrix(~ Group + Time + Group:Time, data = DF)
DF$probs <- plogis(c(X %*% fixef(model)))
DF

However, note that these probabilities are conditional on the random effects and will not match with the population-averaged probabilities at the same combinations of Group and Time; for more on this topic, check here.

Related Solutions

Solved – Results of Type-3 Wald Chi-Square Different for GLMM with Different Contrast Coding

You have to be a little careful with car:Anova(..., type=3) since it only performs Type III tests if the sum-to-zero constrast coding (contr.sum) is used and if there are no "missing cells" or other estimability issues (to borrow a term from linear model theory).

The contrasts that car:Anova(..., type=3) produce is what I would call marginal contrasts and also what you get from nlme::anova.lme(..., type="marginal"). In "nice" cases these contrasts correspond to Type III contrasts if contr.sum is used.

My advice is therefore to:

Look at your data to check if there are any missing cells (use table(group, week)). Estimability issues often show up as NA in summary coefficient tables if you use lm but glmer may not display them. From the looks of your model summary you are probably ok.
Set the contrast-coding to contr.sum, fit the model, obtain the Type III tests:

options(contrasts = c("contr.sum", "contr.poly")) drugMod <- glmer(drugCrime ~ group*week + (1|id), ....) car::Anova(drugMod, type=3)
It is possible to obtain the Type III tests using contrasts of Least-Square Means (or in this case their GLM generalisation). Russell Lenth has added extensive guides to his emmeans package and this vignette describes how to obtain the type III tests in this way. I would compare these with the results from car::Anova. Note that this approach will also not yield Type III if there are missing cells or estimability issues, but emmeans warns you if it detects such problems (and I think this is also implemented for glmer fits).

This very recommendable vignette has a very insightful discussion of Type III contrasts. It termed the approach in car:Anova(..., type=3) Not-Safe-Type-Three and it mentions that the instruction to use contr.sum is in the car-book (see footnote on page 15). It also has a description of which hypotheses are actually being tested if you use contr.treatment or contr.SAS and why these are probably not hypotheses that you are actually interested in (this addresses the "will not make sense" quote you mention).

Edit in order to answer: "what is 'missing cells' and how to detect them?"

As an example of a real data set with missing cells, consider the soup data from the ordinal package. Note the 0-entry in the following tabulation:

data("soup", package="ordinal")
soup$RESPONSE <- as.numeric(as.character(soup$SURENESS))
with(soup, table(DAY, PRODID)) 
   PRODID
DAY   1   2   3   4   5   6
  1 369  94 184  93  88  93
  2 370 275   0  92  97  92

If we fit a model with the interaction between PRODID and DAY, one coefficient is not estimable which shows up as NA:

fm <- lm(RESPONSE ~ DAY * PRODID, data=soup)
summary(fm)

Call:
lm(formula = RESPONSE ~ DAY * PRODID, data = soup)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2043 -1.4486  0.8043  1.3818  2.5514 

Coefficients: (1 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.94580    0.09421  41.883  < 2e-16 ***
DAY2         -0.49715    0.13314  -3.734 0.000194 ***
PRODID2       0.67122    0.20908   3.210 0.001349 ** 
PRODID3       1.11398    0.16332   6.821 1.23e-11 ***
PRODID4       0.59183    0.20998   2.819 0.004876 ** 
PRODID5       0.92920    0.21469   4.328 1.58e-05 ***
PRODID6       1.25850    0.20998   5.993 2.47e-09 ***
DAY2:PRODID2  0.49831    0.25392   1.962 0.049861 *  
DAY2:PRODID3       NA         NA      NA       NA    
DAY2:PRODID4  0.51386    0.29756   1.727 0.084347 .  
DAY2:PRODID5  0.62215    0.29784   2.089 0.036854 *  
DAY2:PRODID6  0.48850    0.29756   1.642 0.100823    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.81 on 1836 degrees of freedom
Multiple R-squared:  0.1022,    Adjusted R-squared:  0.09735 
F-statistic: 20.91 on 10 and 1836 DF,  p-value: < 2.2e-16

In general any source of non-estimability is a problem but R goes a long way to eliminate such problems in the way design matrices are constructed. I can think of two relevant sources of non-estimability: missing cells and perfectly correlated covariates and most people already automatically avoid the latter.

If there were no missing cells the Type III test of DAY would test the difference between days at an average PRODID (i.e., a simple, flat, unweighted average). This is exactly the Yates contrast that Therneau discusses, but in the presence of missing cells this straight-forward interpretation no longer holds. You might want to stay clear of Type III tests in such cases due to lack of interpretability...

Logistic Regression in R – How to Obtain Profile Likelihood 95% Confidence Interval?

My collaborator Rafael de Andrade Moral has written a R function that allows to estimate the uncertainty for L50 estimates according to 13 different methods, including the profile-likelihood method. We used this R function in our recently-published paper in Fisheries Research:

Monitoring reproduction in fish: Assessing the adequacy of ogives and the predicted uncertainty of their L50 estimates for more reliable biological inferences

to evaluate the coverage probability of different approaches, such as the Delta method, non-parametric bootstrapping and Fieller (1944)'s analytical approach. The profile-likehood method did pretty good but the Monte Carlo approach with BCa interval led to the best nominal coverage for the simulated dataset that we have analyzed.

The R scripts for the profile-likelihood method (and others) to assess uncertainty in the L50 (or else such as the L95) is available at:

https://github.com/rafamoral/L50/blob/main/confint_L.R

The supplementary material of our MS contains information on how to use the confint_L() function.

Cheers,

Best Answer

Related Solutions

Solved – Results of Type-3 Wald Chi-Square Different for GLMM with Different Contrast Coding

Edit in order to answer: "what is 'missing cells' and how to detect them?"

Logistic Regression in R – How to Obtain Profile Likelihood 95% Confidence Interval?

Related Question