Solved – Correct use of Negative Binomial with a Geometric distribution in a mixed model (glmmPQL)

geometric-distributionglmmmultiple regressionnegative-binomial-distributionr

I am trying to fit a NB GLMM with a geometric distribution. I have come across very little information on this form of regression. And would like some pointers/reassurance.

some literature is available for glm method here: http://cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf

but I cannot find anything on a mixed model to use with this.

main points:

Is this code correct to run such a model?
Is this the best package for this analysis?
Does model selection and validation follow that for regular Negative binomial models?

sample data:

DF <- structure(list(code = structure(c(1L, 1L, 6L, 6L, 7L, 9L, 10L, 
10L, 10L, 10L, 10L, 11L, 11L, 12L, 13L, 14L, 14L, 16L, 16L, 17L, 
17L, 23L, 24L, 26L, 27L, 27L, 27L, 28L, 28L, 29L, 30L, 30L, 31L, 
32L, 34L, 35L, 8L, 8L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 
25L, 25L, 25L, 15L, 33L, 33L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 
19L, 19L, 18L, 5L, 5L, 4L, 4L, 22L, 21L, 21L, 20L, 20L), .Label = c("15010212", 
"15010213", "15010215", "15010216", "15010220", "15010222", "15010245", 
"15010269", "15010274", "15010284", "15010285", "15010287", "15010290", 
"15010291", "15010292", "15010294", "15010299", "15020313", "15020314", 
"15020315", "15020316", "15020317", "15020326", "15020345", "15020348", 
"15020384", "15020395", "15020396", "15020397", "15030312", "15030317", 
"15030349", "15030392", "15030394", "15030395"), class = "factor"), 
flow = c(15.97766667, 14.226, 17.15724762, 14.7465, 39.579, 
23.355, 110.2926923, 71.95709524, 50.283, 66.66754955, 38.9218, 
72.73666667, 32.37466667, 50.34905172, 27.98471429, 49.244, 
109.1759778, 77.71733333, 37.446875, 101.23875, 67.78534615, 
21.359, 36.54257143, 34.13961111, 64.35253333, 80.98554545, 
68.0554, 61.50857143, 48.983, 63.81072727, 26.105, 46.783, 
23.0605, 33.61557143, 46.31042857, 62.37061905, 12.565, 42.31983721, 
15.3982, 14.49625, 16.40853846, 17.84350847, 14.625375, 13.10714286, 
13.35466667, 12.94033333, 13.54236364, 14.10023711, 12.5747807, 
23.77425, 25.626, 15.23888523, 74.62485714, 170.1547778, 
91.292, 71.422, 42.50887568, 53.89983761, 141.7211667, 50.67125, 
48.098, 66.83644444, 76.564875, 80.63189189, 136.0573243, 
136.3484, 86.68688889, 34.82169565, 70.00415385, 64.67233333, 
81.72766667, 57.74522034), success = c(0L, 1L, 0L, 1L, 1L, 
1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 
1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 
1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 
0L, 1L, 1L, 0L, 1L, 0L, 1L), length = c(595, 595, 582, 582, 
565, 537, 585, 585, 585, 585, 585, 595, 595, 607, 625, 627, 
627, 607, 607, 644, 644, 620, 560, 567, 615, 615, 615, 595, 
595, 546, 580, 580, 594, 605, 610, 640, 575, 575, 632, 632, 
632, 632, 632, 632, 632, 632, 632, 632, 632, 525, 585, 585, 
624, 624, 624, 624, 624, 624, 624, 608, 635, 635, 655, 670, 
670, 584, 584, 707, 680, 680, 740, 740), attempt = structure(c(1L, 
2L, 1L, 2L, 1L, 1L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 1L, 1L, 1L, 
2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 3L, 1L, 2L, 1L, 1L, 
2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 
9L, 10L, 11L, 1L, 1L, 2L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 
1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L), .Label = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11"), class = "factor")), .Names =         c("code", 
"flow", "success", "length", "attempt"), row.names = c(NA, -72L
), class = "data.frame")

model is as follows. setting theta = 1 should determine a geometric distribution.

library(MASS)
M1<-glmmPQL(success ~ length + flow + attempt,
          random = ~ 1|code,
          family = negative.binomial(theta = 1),
          data = DF)

summary(M1)

Ultimately I am trying predict success (0 = fail, 1 = success). However this is measured for many different individuals (code), essentially a repeated measure and hence should be included as a random effect. each individual may only have one success but can have multiple attempts. Predictors of success come in the form of "length" of the individual, "attempt" number… a factor of the number of the attempt, and "flow" which is river flow at time of the attempt and so a continuous variable.

Best Answer

Because your outcome variable "success" is a binary variable and you do want to predict success, I would use mixed-effects logistic regression. You can keep using glmmPQL or glmer in the lme4 package. This post has a great tutorial.

If you want to model your outcome variable as a geometric distributed variable as you originally did, then your outcome needs to be the number of attempts until the first success: max(attempt).