Proportion data with number of trials known (and separation?): GLM or beta regression

beta-regressionbinomial distributiongeneralized linear modelrseparation

I perform a lot of bioassays in which I score mortality not on individuals, but on groups of individuals as a proportion (the denominator, i.e., number of trials, is known):

Sample Data:

library(tidyverse)
library(betareg)
dataset <- data.frame(treatment = rep(c("a", "b", "c", "d"), each = 5),
                      success = c(6,6,9,6,5,10,10,9,10,10,6,8,9,9,10,7,8,6,8,7),
                      fail = c(4,3,1,4,5,0,0,0,0,0,4,2,1,1,0,3,2,4,2,3)) %>%
  mutate(n = success+fail,
         treatment = factor(treatment, levels = c("a", "b", "c", "d")))

Generally I would run this as a GLM with a quasibinomial family, but in some instances (like the one provided) I have a situation where all replications in one treatment was 0% or 100%. In this situation Std. Error and CI estimates are extremely large for that treatment. I assume this is something similar (or identical) to complete separation, although I have only ever seen complete separation discussed in binomial response data consisting of 0's and 1's.

Regular GLM:

data.list <- list(success.fail = cbind(dataset$success, dataset$fail), treatment = dataset$treatment)
binary.glm <- glm(success.fail ~ treatment, data = data.list, family = quasibinomial)
summary(binary.glm)

> summary(binary.glm)

Call:
glm(formula = success.fail ~ treatment, family = quasibinomial, 
    data = data.list)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.81457  -0.33811   0.00012   0.54942   1.86737  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)    0.6325     0.2622   2.412   0.0282 *
treatmentb    20.4676  2890.5027   0.007   0.9944  
treatmentc     1.0257     0.4271   2.402   0.0288 *
treatmentd     0.3119     0.3801   0.821   0.4239  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasibinomial family taken to be 0.7634611)

    Null deviance: 43.36  on 19  degrees of freedom
Residual deviance: 13.40  on 16  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 18

I've read about dealing with separation in this post, but I haven't a way to use Firth's penalized likelihood with proportion data like mine. Instead I turned to beta regression as an option. Is it reasonable to use my n values (number of observations) for each proportion as a weighting value in the weights argument of betareg()? I transformed my proportions to fit within the (0,1) interval using equation (4) of Smithson & Verkuilen, "A Better Lemon Squeezer? Maximum-Likelihood Regression With Beta-Distributed Dependent Variables" (direct link to PDF).

Beta Regression:

lemonsqueeze.fun <- function(df, success, fail) {
  new <- df %>%
    mutate(prop.naught = {{success}}/({{success}}+{{fail}}),
           n = {{success}}+{{fail}},
           y.prime = (prop.naught-0)/(1-0),
           y.doubleprime = ((y.prime*(length(y.prime)-1))+0.5)/length(y.prime))
  return(new)
}

transformed.data <- lemonsqueeze.fun(dataset, success, fail)
beta.mod <- betareg(y.doubleprime ~ treatment, data = transformed.data, weights = n)
summary(beta.mod)

> summary(beta.mod)

Call:
betareg(formula = y.doubleprime ~ treatment, data = transformed.data, weights = n)

Standardized weighted residuals 2:
    Min      1Q  Median      3Q     Max 
-6.8894 -1.9274  0.0000  0.5899  8.4217 

Coefficients (mean model with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.63875    0.07291   8.761   <2e-16 ***
treatmentb   2.30484    0.14846  15.525   <2e-16 ***
treatmentc   1.04830    0.11600   9.037   <2e-16 ***
treatmentd   0.21245    0.10408   2.041   0.0412 *  

Phi coefficients (precision model with identity link):
      Estimate Std. Error z value Pr(>|z|)    
(phi)   15.765      1.602   9.841   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Type of estimator: ML (maximum likelihood)
Log-likelihood: 229.5 on 5 Df
Pseudo R-squared: 0.7562
Number of iterations: 21 (BFGS) + 2 (Fisher scoring) 

In all cases the Std. Error is much smaller and we no longer have a crazy estimate for treatment b. Are there problems with this? Assumptions I should check? Better ways to skin this cat? Alas, I'm not a statistician, merely a biologist with barely enough statistical knowledge to be dangerous.

Best Answer

The data you have are really a classic binomial setting and you can use binomial GLMs to model the data, either using the standard maximum likelihood (ML) estimator or using a bias-reduced (BR) estimator (Firth's penalized estimator). I wouldn't use beta regression here.

For treatment b you have 49 successes and 0 failures.

subset(dataset, treatment == "b")
##    treatment success fail  n
## 6          b      10    0 10
## 7          b      10    0 10
## 8          b       9    0  9
## 9          b      10    0 10
## 10         b      10    0 10

Hence there is quasi-complete separation leading to an infinite ML estimator while the BR estimator is guaranteed to be finite. Note that the BR estimator provides a principled solution for the separation while the trimming of the proportion data is rather ad hoc.

For fitting the binomial GLM with ML and BR you can use the glm() function and combine it with the brglm2 package (Kosmidis & Firth, 2020, Biometrika, doi:10.1093/biomet/asaa052). An easier way to specify the binomial response is to use a matrix where the first column contains the successes and the second column the failures:

ml <- glm(cbind(success, fail) ~ treatment, data = dataset, family = binomial)
summary(ml)
## Call:
## glm(formula = cbind(success, fail) ~ treatment, family = binomial, 
##     data = dataset)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.81457  -0.33811   0.00012   0.54942   1.86737  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    0.6325     0.3001   2.108   0.0351 *
## treatmentb    20.4676  3308.1100   0.006   0.9951  
## treatmentc     1.0257     0.4888   2.099   0.0359 *
## treatmentd     0.3119     0.4351   0.717   0.4734  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.36  on 19  degrees of freedom
## Residual deviance: 13.40  on 16  degrees of freedom
## AIC: 56.022
## 
## Number of Fisher Scoring iterations: 18

The corresponding BR estimator can be obtained by using the "brglmFit" method (provided in brglm2) instead of the default "glm.fit" method:

library("brglm2")
br <- glm(cbind(success, fail) ~ treatment, data = dataset, family = binomial,
  method = "brglmFit")
summary(br)
## Call:
## glm(formula = cbind(success, fail) ~ treatment, family = binomial, 
##     data = dataset, method = "brglmFit")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7498  -0.2890   0.4368   0.6030   1.9096  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.6190     0.2995   2.067  0.03875 * 
## treatmentb    3.9761     1.4667   2.711  0.00671 **
## treatmentc    0.9904     0.4834   2.049  0.04049 * 
## treatmentd    0.3041     0.4336   0.701  0.48304   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.362  on 19  degrees of freedom
## Residual deviance: 14.407  on 16  degrees of freedom
## AIC:  57.029
## 
## Type of estimator: AS_mixed (mixed bias-reducing adjusted score equations)
## Number of Fisher Scoring iterations: 1

Note that the coefficient estimates, standard errors, and t statistics for the non-separated treatments only change very slightly. The main difference is the finite estimate for treatment b.

Because the BR adjustment is so slight, you can still employ the usual inference (Wald, likelihood ratio, etc.) and information criteria etc. In R with brglm2 this is particularly easy because the br model fitted above still inherits from glm. As an illustration we can assess the overall significance of the treatment factor:

br0 <- update(br, . ~ 1)
AIC(br0, br)
##     df      AIC
## br0  1 79.98449
## br   4 57.02927
library("lmtest")
lrtest(br0, br)
## Likelihood ratio test
## 
## Model 1: cbind(success, fail) ~ 1
## Model 2: cbind(success, fail) ~ treatment
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   1 -38.992                         
## 2   4 -24.515  3 28.955  2.288e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Or we can look at all pairwise contrasts (aka Tukey contrasts) of the four treatments:

library("multomp")
summary(glht(br, linfct = mcp(treatment = "Tukey")))
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## Fit: glm(formula = cbind(success, fail) ~ treatment, family = ## binomial, 
##     data = dataset, method = "brglmFit")
## 
## Linear Hypotheses:
##            Estimate Std. Error z value Pr(>|z|)  
## b - a == 0   3.9761     1.4667   2.711   0.0286 *
## c - a == 0   0.9904     0.4834   2.049   0.1513  
## d - a == 0   0.3041     0.4336   0.701   0.8864  
## c - b == 0  -2.9857     1.4851  -2.010   0.1641  
## d - b == 0  -3.6720     1.4696  -2.499   0.0517 .
## d - c == 0  -0.6863     0.4922  -1.394   0.4743  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Finally, for the standard ML estimator you can also use the quasibinomial family as you did in your post, although in this data set you don't have overdispersion (rather a little bit of underdispersion). However, the brglm2 package does not support this at the moment. But given that there is no evidence for overdispersion I would not be concerned about this here.

Related Question