Solved – Overdispersion and modeling alternatives in Poisson random effect models with offsets

generalized linear modelglmmnegative-binomial-distributionoverdispersionpoisson-regression

I have run into a number of practical questions when modeling count data from experimental research using a within-subject experiment. I briefly describe the experiment, data, and what I have done so far, followed by my questions.

Four different movies were shown to a sample of respondents in sequence. After each movie an interview was conducted of which we counted the number of occurrences of certain statements that were of interest for the RQ (predicted count variable). We also recorded the maximum number of possible occurrences (coding units; offset variable). In addition, several features of the movies were measured on a continuous scale, of which for one we have a causal hypothesis of an effect of the movie feature on the count of statements while the others are control (predictors).

The modeling strategy adopted so far is as follows:

Estimate a random effect Poisson model, where the causal variable is used as a covariate and the other variables as control covariates. This model has an offset equal to ‘log(units)’ (coding units). Random effects are taken across subjects (movie-specific counts are nested in subjects). We find the causal hypothesis confirmed (sig. coefficient of causal variable). In estimation we used the lme4 package in R, in particular the function glmer.

Now I have the following questions. A common problem in Poisson regression is overdispersion. I know that this can be tested by using a negative binomial regression and evaluating whether its dispersion parameter improves model fit of a simple Poisson model. However, I do not know how to do so in a random effect context.

  • How should I test for overdispersion in my situation? I tested overdispersion in a simple Poisson/negative binomial regression (without random effects) that I know how to fit. The test suggests presence of overdispersion. However since these models do not take the clustering into account I suppose this test is incorrect. Also I am not sure about the role of the offset for tests of overdispersion.
  • Is there something like a negative binomial random effect regression
    model and how should I fit it in R?
  • Do you have suggestions for alternative models that I should try on the data, i.e. taking the repeated measures structure, count variables and exposure (coding units) into account?

Best Answer

There is a maximum possible number of counted answers, related to the number of questions asked. Although one can model this as a Poisson process of the counting type, another interpretation is that a Poisson process has no theoretical limit for the number of counted answers, that is, it is on $[0,\infty)$. Another distribution, i.e., a discrete one that has finite support, e.g., the beta binomial, might be more appropriate as it has a more mutable shape. However, that is just a guess, and, in practice, I would search for an answer to a more general question using brute force...

Rather than check for overdispersion, which has no guarantee of leading to a useful answer, and, although one can examine indices of dispersion to quantify dispersion, I would more usefully suggest searching for a best distribution using a discrete distribution option of a fit quality search program, e.g., Mathematica's FindDistribution routine. That type of a search does a fairly exhaustive job of guessing what known distribution(s) work(s) best not only to mitigate overdispersion, but also to more usefully model many of other data characteristics, e.g., goodness of fit as measured a dozen different ways.

To further examine my candidate distributions, I would post hoc examine residuals to check for homoscedasticity, and/or distribution type, and also consider whether the candidate distributions can be reconciled as corresponding to a physical explanation of the data. The danger of this procedure is identifying a distribution that is inconsistent with best modelling of an expanded data set. The danger of not doing a post hoc procedure is to a priori assign an arbitrarily chosen distribution without proper testing (garbage in-garbage out). The superiority of the post hoc approach is that it limits the errors of fitting, and that is also its weakness, i.e., it may understate the modelling errors through pure chance as many distributions fits are attempted. That then, is the reason for examining residuals and considering physicality. The top down or a priori approach offers no such post hoc check on reasonableness. That is, the only method of comparing the physicality of modelling with different distributions, is to post hoc compare them. Thus arises the nature of physical theory, we test a hypothetical explanation of data with many experiments before we accept them as exhausting alternative explanations.

Related Question