Should I transform the count response variable to continuous response variable for mixed models (glmms)

biostatisticscount-dataglmmmixed modelregression

I conducted an experiment where I counted the number of infected leaves per plant. There were four replicate plants per treatment (total two treatments). So my response variable is positive count data, ranging from 0-392. I'm going to use generalized linear mixed models to see the effect of weather variables and mulching status on disease development (number of infected leaves). My question are,

  1. Should my response variable be number of infected leaves PER PLANT or on FOUR PLANTS (there were four plants per treatment)?

  2. Should my response variable be count data or continuous? That is, should I use number of infected leaves per plant/per treatment (count data), or should I take an average of number of infected leaves per plant/per treatment (question 1), and thus use a continuous response variable? Total number of leaves per plant had not been counted, but an average of 3132 total leaves per plant are normally used in the literature for this plant species (Justin Brouwers boxwoods).

If you're interested in the details of the experiment, I left out my potted plants in the field for a week, took them back to the glasshouse and counted the number of infected leaves per plant after two weeks. I had two treatments – mulched (a practice where diseased plants materials left from previous years are covered with straw to prevent fungus spores splash dispersal to plants during rain), and non-mulched (infected plant remains from previous years are not covered, so we'd expect more spores being splashed to plants during rain. Disease is only spread by rain). The experiment was conducted over two years. In the first year, plants were taken out to the field for 30 weeks and in the second for 26 weeks, so total 56 weeks. If I use average or number of infected leaves PER TREATMENT (four plants), I'll have 56 data points. If I use number (or average) of infected leaves PER PLANT than I will have 56 x 4 = 224 data points. I have a total of 8 predictors ( mulched, non-mulched, wind speed, wind direction, total rain, rain duration, relative humidity and temperature during each week plants were in the field). Thanks very much for any assistance.

Best Answer

As you have count data then it will be best to use a count model (Poisson, quasi-poisson, negative binomial) rather than treating the values as continuous.* As the highest number of infected leaves is only about 12% of the typical total number of leaves, you won't have to worry about hitting an upper limit of count numbers. One limitation is that you only know the number of infected leaves per plant, while what's probably of more interest is the probability of a leaf getting infected. That would involve a binomial model and require knowing numbers of total leaves per plant.

As you did not use the same plants repeatedly, think carefully whether a mixed model is really appropriate or helpful. With data over only 2 years, it doesn't make sense to treat the calendar year as a random effect. Whether a random effect for location makes sense similarly can depend on the number of locations. It's often thought best to have on the order of 6 groups (locations, here) to treat them as a random effect; see this page and this page for some discussion. You don't seem to have that many locations.

You propose to use the week as a random effect, but I suspect that (as typical for studies over time) you will have correlations in effects from week to week that a simple random intercept as you propose for week wouldn't adequately handle. It can be better to model week as a continuous variable but flexibly, with a regression spline or with smoothers in a generalized additive model.

For your main question about whether to use the numbers for each plant or the sum over the 4 plants for each treatment/week/location, it's generally best to analyze data as close as possible to the original unit of measurement. That argues for using values for each plant separately. An average over 4 plants wouldn't be appropriate for a count model. Treating the plants separately will better document whether there is extra variance among plants that needs to be handled with a quasi-Poisson or negative binomial model.

You should recognize that the main sample-size issue in count-based data is the number of counts rather than the number of individuals from which they were obtained. For example, if you did a binomial model with data on one leaf at a time and a probability less than 50% of infection per leaf, then the limiting factor would be the total number of infected leaves. And if the data are truly Poisson with the same distribution for each plant having the same treatment/week/location, the sum over 4 such plants would also be Poisson.


*If you nevertheless choose to treat the values as continuous, note that the variance of the count values will change with the number of counts in a way that simple linear regression won't acccount for. Modeling the square root of the number of counts is an alternative that sometimes works well, but provides results that are harder to explain to others.

Related Question