Solved – Proportion/Rate data and zero-inflation (two counts)

beta distributiongeneralized linear modellme4-nlmeregression

I have an experimental dataset which makes use of two counts. Using vague terms, we are studying animals as they become behaviourally inactive and then apply a stimulus once an hour. One of the counts is the number of the times the animal reacted to the stimulus (IF it was inactive). The second count is binning the reaction depending on how long it was inactive beforehand.

The idea is… those animals which were inactive for a longer time respond less. We want to see how this changes depending on certain experiment manipulations.

e.g.

Animal | Inactive Bin | Histogram Count | Reaction Count
1 0-10 minutes 5 2 [more responsive]
2 11-20 minutes 2 0 [unresponsive]
3 0-10 minutes 0 0 [active at the time of stimulus]
4 11-20 minutes 0 0 [active at the time of stimulus]

[edit: I'm sorry I'm not sure how to adjust the formatting! I will try again in a few hours.]

If a proportion is calculated, get a variable called 'reaction proportion'. However, the data (count or proportion) includes a large number of zeros which include the animals which were active at the time of the stimulation.

I am interested in the zero-inflation aspect of the data as well as the positive cases. Those animals that are not in an 'inactivity bin' will not have a reaction count, but those animals might be useful to take into account for the statistical model. We put the animals in different experiment groups so we want to properly model how these counts vary…

I have read about zero-inflation models but the ones in R (like zoib, pscl) seem to only take 1 count variable? Maybe I have misunderstood something. zeroinfl() can't take a rate but it looks like hurdle() might… I am not clear on the syntax for doing this…

I suppose my question is what is the best model I should use in order to fully understand this data? And what link function/family should I use? I looked into lmer() models for the proportion of Reaction Count/ Histogram Count but the residuals are quite weirdly distributed on account of the high number of 0s and 1s (and the dependence of Reaction Count on the value of Histogram Bin). This made me think a glm with a poisson link or something similar might be better.

Furthermore there is the usual proportions issue of animals that react 5 times and are in a given bin 5 times have a reaction proportion of 100%, yet an animal could be react 20 times and be in a bin 20 times and the proportion is still 100%… so I'm not sure if a proportion is the best way to go. Yet how should I take into account the weird bounded relationship between the counts without calculating a proportion?

I would be happy to give more information if necessary, like residual plots, or simply more information about the dataset. I would really appreciate some opinions about how to proceed.

Best Answer

A count data model is at the same time a rate (events per observation time) model, if you use an offset. E.g. number of events/reactions/times/whatever per time unit can be modelled with the number as the dependent variable and the observation time as an offset with a log-link. Similarly, if your rate is event/reaction/whatever out of N tries, then you can just model that as binomial (y events out of N tries) using logistic regression rather than modeling the proportion. The reason why it is usually better to model things in this manner is that you have more information, if you have seen more observation time or more tries, but if you just look at the rate or proportion, then this is not reflected (while a count or binomial model does reflect it).

Having a lot of zeros is not per-se a reason to go for a zero-inflated model. The question is really are there more zeros than you would expect under a simpler model. The obvious models to try first are GLMs (Poisson or logistic regression) with an added random effect, which are GLMMs (Poisson or logistic regression) or negative binomial regression. You can fit such models in R using the glmer function in lme4 using the Poisson family with a log link or the binomial family with a logit link. Which random effects make sense depends on your experiment, which was not so clear from your description (but if you have e.g. multiple records for an individual, you would probably want at least a random individual effect on the intercept, i.e. (1|individual) ). The nice thing about a GLMM is that you can estimate a latent subject random effect (individuals that just have fewer events than others), while a zero-inflated model is similar, but there is a stark dichotomy (either an individual never has events or it follows the a usual distribution).

Looking at residuals for models for discrete data is challenging (see e.g. What do the residuals in a logistic regression mean?). Other options including (posterior) predictive checks. With nested models (e.g. a logistic regression GLMM vs. a logistic regression GLMM with zero inflation), you can also do tests of whether the more complex model is needed. However, note that deciding on your model based on your data will invalidate hypothesis tests that you conduct based on the same data. I.e. if you want to have valid hypothesis tests, p-values and so on that can be claimed to be "confirmatory", you really should be pre-specifying your analysis prior to seeing the data based on theoretical considerations and/or based on previous data that allows you to judge what model will be appropriate.