Solved – Post-hoc test for binomial GLM with some cases having probabilities of 1

binomial distributiongeneralized linear modelpost-hoc

I am studying the effect of plant survival on location and genotype. I fitted a binomial GLM and conducted a post-hoc test after significant interaction using the emmeans package. When survival was 100%, i.e. no mortality whatsoever, the lower and upper confidence limits extended all the way from 0 to 1 and were not significantly different from the other survival probabilities:

m1 <- glm(cbind(totalalive, totaldead) ~ f1 * f2,  family = binomial, df)
summary(m1)

#Call:
#glm(formula = cbind(totalalive, totaldead) ~ f1 * f2, family = binomial, 
#    data = df)
#
#Deviance Residuals: 
#     Min        1Q    Median        3Q       Max  
#-2.94105   0.00011   0.00011   0.78201   1.68444  
#
#Coefficients:
#              Estimate Std. Error z value Pr(>|z|)
#(Intercept)  2.092e+01  2.371e+03   0.009    0.993
#f1B          2.002e-08  3.353e+03   0.000    1.000
#f1C         -1.655e+01  2.371e+03  -0.007    0.994
#f2V2         2.032e-08  3.353e+03   0.000    1.000
#f2V3        -1.858e+01  2.371e+03  -0.008    0.994
#f1B:f2V2    -1.928e+01  4.106e+03  -0.005    0.996
#f1C:f2V2    -2.730e+00  3.353e+03  -0.001    0.999
#f1B:f2V3     3.635e-01  3.353e+03   0.000    1.000
#f1C:f2V3     1.746e+01  2.371e+03   0.007    0.994
#
#(Dispersion parameter for binomial family taken to be 1)
#
#    Null deviance: 137.380  on 89  degrees of freedom
#Residual deviance:  80.419  on 81  degrees of freedom
#AIC: 148.99
#
#Number of Fisher Scoring iterations: 18

To conduct a post-hoc test, I used the emmeans() function of the emmeans package:

cld(emmeans(m1, ~f1 : f2), type="response", Letters = letters, adjust = "none")

 #f1 f2   prob           SE  df    asymp.LCL asymp.UCL .group
 #C  V2 0.8375 4.124527e-02 Inf 7.399573e-01 0.9032387  a    
 #B  V2 0.8375 4.124527e-02 Inf 7.399573e-01 0.9032387  a    
 #A  V3 0.9125 3.159188e-02 Inf 8.276478e-01 0.9577123  ab   
 #B  V3 0.9375 2.706329e-02 Inf 8.584872e-01 0.9737457  ab   
 #C  V3 0.9625 2.124081e-02 Inf 8.901011e-01 0.9878549   b   
 #C  V1 0.9875 1.242163e-02 Inf 9.166073e-01 0.9982419   b   
 #A  V1 1.0000 1.939592e-06 Inf 2.220446e-16 1.0000000  ab   
 #B  V1 1.0000 1.939592e-06 Inf 2.220446e-16 1.0000000  ab   
 #A  V2 1.0000 1.939592e-06 Inf 2.220446e-16 1.0000000  ab

And here is a plot of the results:

enter image description here

As can be seen by the lower and upper confidence limits, the instances with 100% survival are not significantly different from the other groups. I am sure the reason for this is that there is no variation in those cases with 100% survival (standard error is effectively zero) and it is not possible to test whether they are significantly different to the others.

Is this correct? If that's the case, would it be:

(a) reasonable to remove the letters from those cases with 100% survival and test for significant differences only on those cases where the probability of survival is $<1$?

Or (b) should I leave the significance letters in the plot and just simply explain it in the figure caption as I did above?

At this point (see figure above) it may look strange to the reader how SiteC - GenotypeV2 can be different from SiteC - GenotypeV1 but SiteB - GenotypeV2 is not different from SiteB - GenotypeV1.

EDIT

For clarification, I also added a boxplot of the raw data to illustrate this. The number of replicates is 10 per factor combination:

enter image description here

Best Answer

Complete separation occurs in logistic (and binomial, and Poisson) regression when some categories contain 100% failures (or zero counts) or (in the logistic/binomial cases) 100% successes. In this case, the 'true' estimates are infinite (because logistic regression parameters are estimated on the logit scale, and logit(0) $\to -\infty$ while logit(1) $\to \infty$). Logistic regression typically leads to extreme (but not infinite) values of the parameter estimates, e.g. $|\hat \beta| > 8$. Worse, the usual Wald standard errors also tend to be very large, because the approximation of a quadratic log-likelihood surface that the Wald SEs depend on is very bad.

There's more explanation of complete separation here.

A reasonable solution is Firth or bias-reduced logistic regression, implemented in R's brglm2 package as follows:

## install.packages("brglm2")
library(brglm2)
m1 <- glm(cbind(totalalive,totaldead) ~ f1 * f2, 
    family = binomial, df, method=brglmFit)