As you have count data then it will be best to use a count model (Poisson, quasi-poisson, negative binomial) rather than treating the values as continuous.* As the highest number of infected leaves is only about 12% of the typical total number of leaves, you won't have to worry about hitting an upper limit of count numbers. One limitation is that you only know the number of infected leaves per plant, while what's probably of more interest is the probability of a leaf getting infected. That would involve a binomial model and require knowing numbers of total leaves per plant.
As you did not use the same plants repeatedly, think carefully whether a mixed model is really appropriate or helpful. With data over only 2 years, it doesn't make sense to treat the calendar year as a random effect. Whether a random effect for location
makes sense similarly can depend on the number of locations. It's often thought best to have on the order of 6 groups (locations, here) to treat them as a random effect; see this page and this page for some discussion. You don't seem to have that many locations.
You propose to use the week
as a random effect, but I suspect that (as typical for studies over time) you will have correlations in effects from week to week that a simple random intercept as you propose for week
wouldn't adequately handle. It can be better to model week
as a continuous variable but flexibly, with a regression spline or with smoothers in a generalized additive model.
For your main question about whether to use the numbers for each plant or the sum over the 4 plants for each treatment/week/location, it's generally best to analyze data as close as possible to the original unit of measurement. That argues for using values for each plant separately. An average over 4 plants wouldn't be appropriate for a count model. Treating the plants separately will better document whether there is extra variance among plants that needs to be handled with a quasi-Poisson or negative binomial model.
You should recognize that the main sample-size issue in count-based data is the number of counts rather than the number of individuals from which they were obtained. For example, if you did a binomial model with data on one leaf at a time and a probability less than 50% of infection per leaf, then the limiting factor would be the total number of infected leaves. And if the data are truly Poisson with the same distribution for each plant having the same treatment/week/location, the sum over 4 such plants would also be Poisson.
*If you nevertheless choose to treat the values as continuous, note that the variance of the count values will change with the number of counts in a way that simple linear regression won't acccount for. Modeling the square root of the number of counts is an alternative that sometimes works well, but provides results that are harder to explain to others.
First, the default ordering of a categorical predictor in R is alphabetical, so "August" is chosen here for the regression. You can re-order the months any way that you want. I'd recommend coding the months in order in that categorical month
predictor, as in your graph, to simplify your view of the modeled outcomes.
Part of your problem is that there seem to be no infected leaves in either May or November. That will tend to give extreme coefficient estimates and associated standard errors, similar to what's seen with perfect separation in logistic regression.
If the month is either May or November, you seem to know (at least from these data) that there will be no infected leaves, just as for some predictor combinations in a logistic regression model you might seem to know that all cases had (or didn't have) the binary outcome. That probably also accounts for the extreme dispersion parameter in the quasi-Poisson fit. If there's a good reason not to expect any counts in those months (for example, May is too early in the year for any leaves and November too late), maybe just omit those months from your analysis and report why you did so.
You don't have any offset in your Poisson model to take into account the total number of leaves, which would be needed if you are interested in infection rates per leaf. Thus what you are modeling seems just to be the total number of infected leaves regardless of how many leaves there were. That is OK in principle, but it might not be what your readers are expecting at first glance when you talk about "incidence rates."
With the default log link of the Poisson model, the (Intercept) of 4.27 should represent the log of the mean number of infected leaves in the reference month of August. The mean value itself is what's reported as the (Intercept) in the tab_model()
output. If you analyzed per-plant data for about 12 plants in August and the graph of Raw Data is the sum of infected leaves over all the plants for each month, those values seem to represent that raw-data point (approx. 850 total infected leaves) pretty well.
What you see as a failure of the model to represent reality comes from a misunderstanding of the regression coefficients. With the default "treatment" coding of categorical predictors in R, the regression coefficients for the non-reference months are the differences in the log of the mean counts of each month from the August reference month.* Thus September has a higher log-mean count (positive coefficient) than does August, while the other months have lower values (negative coefficients) on that scale. The "incidence rate ratios" derived by tab_model()
from the Poisson model are the ratios of counts relative to the counts of the August reference month. September's ratio is greater than 1; those of the other months are all less than 1.
Those values are all consistent with the raw data. Looking at the October outcome as an example, interpreted correctly as differences in logs (coefficients) or as ratios (tab_model()
), those seem also to represent the raw data of about 540 counts over about 12 plants.
I hesitate to recommend this, but you could get results in the form that you want by omitting the intercept from the model (inluding a -1
term in the formula). For a very simple model like this you will then get separate estimates of log-mean-counts for each month.
The reason I hesitate to recommend that is that you presumably will modify the model to take into account treatments and so forth. Without an intercept in a model you will almost certainly end up with even more confusion in even a slightly more complicated model. I tend to use the default "treatment" coding in R and then use post-modeling tools like those in the R emmeans
package to get the individual estimates I need.
*That's true for all regression models with "treatment" or "dummy" coding. If there's an intercept, then the Intercept is the estimated outcome at reference conditions and regression coefficients represent differences from the intercept.
Best Answer
The different numbers of cases under different treatments isn't a problem on its own. Although there's a long historical preference in ANOVA for having equal numbers of cases in each treatment, a proper regression model will handle different numbers of cases.
The bigger problem is that you can't distinguish certain types of effects. For example, you can't distinguish differences between locations X and Y from differences between the calendar year 2014 versus years 2015-2016.
One way to deal with this is to model all of the data together and then do post-modeling comparisons of specific situations. That uses all of the data to build the model, but restricts subsequent analysis to comparisons that make sense given the limits of your experimental design.
Code each combination of location/treatment/year in a single 9-level categorical predictor: treatments B, C, D at X in 2014; treatments A, B, C at Y in 2015; treatments A, B, C at Y in 2016. You presumably will be using a Poisson or negative binomial model for the count data, and you should also be controlling for week within the year by including that in your model (for example, with a flexible regression spline). You might consider an interaction between your 9-level categorical predictor and your modeling of week within the year.
After you have fit the model, don't pay much attention to the individual coefficient values. Instead, use post-modeling tools like those provided by the R
emmeans
package to evaluate combinations of coefficients that provide useful information. For example, did the overall 2015 and 2016 estimates of treatment effects at location Y differ? What is the difference between treatments B and C, combining both locations? Between D and B/C at location X? Between A and B/C at location Y?