I perform a lot of bioassays in which I score mortality not on individuals, but on groups of individuals as a proportion (the denominator, i.e., number of trials, is known):
Sample Data:
library(tidyverse)
library(betareg)
dataset <- data.frame(treatment = rep(c("a", "b", "c", "d"), each = 5),
success = c(6,6,9,6,5,10,10,9,10,10,6,8,9,9,10,7,8,6,8,7),
fail = c(4,3,1,4,5,0,0,0,0,0,4,2,1,1,0,3,2,4,2,3)) %>%
mutate(n = success+fail,
treatment = factor(treatment, levels = c("a", "b", "c", "d")))
Generally I would run this as a GLM with a quasibinomial family, but in some instances (like the one provided) I have a situation where all replications in one treatment was 0% or 100%. In this situation Std. Error and CI estimates are extremely large for that treatment. I assume this is something similar (or identical) to complete separation, although I have only ever seen complete separation discussed in binomial response data consisting of 0's and 1's.
Regular GLM:
data.list <- list(success.fail = cbind(dataset$success, dataset$fail), treatment = dataset$treatment)
binary.glm <- glm(success.fail ~ treatment, data = data.list, family = quasibinomial)
summary(binary.glm)
> summary(binary.glm)
Call:
glm(formula = success.fail ~ treatment, family = quasibinomial,
data = data.list)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.81457 -0.33811 0.00012 0.54942 1.86737
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6325 0.2622 2.412 0.0282 *
treatmentb 20.4676 2890.5027 0.007 0.9944
treatmentc 1.0257 0.4271 2.402 0.0288 *
treatmentd 0.3119 0.3801 0.821 0.4239
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasibinomial family taken to be 0.7634611)
Null deviance: 43.36 on 19 degrees of freedom
Residual deviance: 13.40 on 16 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 18
I've read about dealing with separation in this post, but I haven't a way to use Firth's penalized likelihood with proportion data like mine. Instead I turned to beta regression as an option. Is it reasonable to use my n values (number of observations) for each proportion as a weighting value in the weights argument of betareg()? I transformed my proportions to fit within the (0,1) interval using equation (4) of Smithson & Verkuilen, "A Better Lemon Squeezer? Maximum-Likelihood Regression With Beta-Distributed Dependent Variables" (direct link to PDF).
Beta Regression:
lemonsqueeze.fun <- function(df, success, fail) {
new <- df %>%
mutate(prop.naught = {{success}}/({{success}}+{{fail}}),
n = {{success}}+{{fail}},
y.prime = (prop.naught-0)/(1-0),
y.doubleprime = ((y.prime*(length(y.prime)-1))+0.5)/length(y.prime))
return(new)
}
transformed.data <- lemonsqueeze.fun(dataset, success, fail)
beta.mod <- betareg(y.doubleprime ~ treatment, data = transformed.data, weights = n)
summary(beta.mod)
> summary(beta.mod)
Call:
betareg(formula = y.doubleprime ~ treatment, data = transformed.data, weights = n)
Standardized weighted residuals 2:
Min 1Q Median 3Q Max
-6.8894 -1.9274 0.0000 0.5899 8.4217
Coefficients (mean model with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.63875 0.07291 8.761 <2e-16 ***
treatmentb 2.30484 0.14846 15.525 <2e-16 ***
treatmentc 1.04830 0.11600 9.037 <2e-16 ***
treatmentd 0.21245 0.10408 2.041 0.0412 *
Phi coefficients (precision model with identity link):
Estimate Std. Error z value Pr(>|z|)
(phi) 15.765 1.602 9.841 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type of estimator: ML (maximum likelihood)
Log-likelihood: 229.5 on 5 Df
Pseudo R-squared: 0.7562
Number of iterations: 21 (BFGS) + 2 (Fisher scoring)
In all cases the Std. Error is much smaller and we no longer have a crazy estimate for treatment b. Are there problems with this? Assumptions I should check? Better ways to skin this cat? Alas, I'm not a statistician, merely a biologist with barely enough statistical knowledge to be dangerous.
Best Answer
The data you have are really a classic binomial setting and you can use binomial GLMs to model the data, either using the standard maximum likelihood (ML) estimator or using a bias-reduced (BR) estimator (Firth's penalized estimator). I wouldn't use beta regression here.
For treatment
b
you have 49 successes and 0 failures.Hence there is quasi-complete separation leading to an infinite ML estimator while the BR estimator is guaranteed to be finite. Note that the BR estimator provides a principled solution for the separation while the trimming of the proportion data is rather ad hoc.
For fitting the binomial GLM with ML and BR you can use the
glm()
function and combine it with thebrglm2
package (Kosmidis & Firth, 2020, Biometrika, doi:10.1093/biomet/asaa052). An easier way to specify the binomial response is to use a matrix where the first column contains the successes and the second column the failures:The corresponding BR estimator can be obtained by using the
"brglmFit"
method (provided inbrglm2
) instead of the default"glm.fit"
method:Note that the coefficient estimates, standard errors, and t statistics for the non-separated treatments only change very slightly. The main difference is the finite estimate for treatment
b
.Because the BR adjustment is so slight, you can still employ the usual inference (Wald, likelihood ratio, etc.) and information criteria etc. In R with
brglm2
this is particularly easy because thebr
model fitted above still inherits fromglm
. As an illustration we can assess the overall significance of thetreatment
factor:Or we can look at all pairwise contrasts (aka Tukey contrasts) of the four treatments:
Finally, for the standard ML estimator you can also use the
quasibinomial
family as you did in your post, although in this data set you don't have overdispersion (rather a little bit of underdispersion). However, thebrglm2
package does not support this at the moment. But given that there is no evidence for overdispersion I would not be concerned about this here.