Can Kruskal-Wallis test be used for count data

anovahypothesis testingkruskal-wallis test”rregression

I collected positive count data, specifically the number of infected leaves per plant, as part of my experiment. To treat the plants, I placed them in the field for a week using four different treatments. Afterward, I brought them back to a controlled environment, counted the number of infected leaves, and DISCARDED the plants. I repeated this process with fresh plants in the following week. This is not a time series data. Treatments were applied for a week at both locations, so the duration of each treatment was the same. Plot size, treatment duration and sample material were identical.

Now, I'm interested in comparing the infection rates across different months. I would like to determine if there are any significant differences. My question is whether I can use the Kruskal-Wallis test for this purpose. I understand that the Kruskal-Wallis test is typically used when the response variable exhibits some form of hierarchy or ordinality, while my data consists of count values. Would it be appropriate if I calculate the mean number of infected leaves per month and then compare the monthly means using Kruskal Wallist test?

Note: I tried Poisson regression but they choose one month as a reference group, so the results do not make sense. I tried different months as a reference group, but results are still the same.

Best Answer

First, the default ordering of a categorical predictor in R is alphabetical, so "August" is chosen here for the regression. You can re-order the months any way that you want. I'd recommend coding the months in order in that categorical month predictor, as in your graph, to simplify your view of the modeled outcomes.

Part of your problem is that there seem to be no infected leaves in either May or November. That will tend to give extreme coefficient estimates and associated standard errors, similar to what's seen with perfect separation in logistic regression.

If the month is either May or November, you seem to know (at least from these data) that there will be no infected leaves, just as for some predictor combinations in a logistic regression model you might seem to know that all cases had (or didn't have) the binary outcome. That probably also accounts for the extreme dispersion parameter in the quasi-Poisson fit. If there's a good reason not to expect any counts in those months (for example, May is too early in the year for any leaves and November too late), maybe just omit those months from your analysis and report why you did so.

You don't have any offset in your Poisson model to take into account the total number of leaves, which would be needed if you are interested in infection rates per leaf. Thus what you are modeling seems just to be the total number of infected leaves regardless of how many leaves there were. That is OK in principle, but it might not be what your readers are expecting at first glance when you talk about "incidence rates."

With the default log link of the Poisson model, the (Intercept) of 4.27 should represent the log of the mean number of infected leaves in the reference month of August. The mean value itself is what's reported as the (Intercept) in the tab_model() output. If you analyzed per-plant data for about 12 plants in August and the graph of Raw Data is the sum of infected leaves over all the plants for each month, those values seem to represent that raw-data point (approx. 850 total infected leaves) pretty well.

What you see as a failure of the model to represent reality comes from a misunderstanding of the regression coefficients. With the default "treatment" coding of categorical predictors in R, the regression coefficients for the non-reference months are the differences in the log of the mean counts of each month from the August reference month.* Thus September has a higher log-mean count (positive coefficient) than does August, while the other months have lower values (negative coefficients) on that scale. The "incidence rate ratios" derived by tab_model() from the Poisson model are the ratios of counts relative to the counts of the August reference month. September's ratio is greater than 1; those of the other months are all less than 1.

Those values are all consistent with the raw data. Looking at the October outcome as an example, interpreted correctly as differences in logs (coefficients) or as ratios (tab_model()), those seem also to represent the raw data of about 540 counts over about 12 plants.

I hesitate to recommend this, but you could get results in the form that you want by omitting the intercept from the model (inluding a -1 term in the formula). For a very simple model like this you will then get separate estimates of log-mean-counts for each month.

The reason I hesitate to recommend that is that you presumably will modify the model to take into account treatments and so forth. Without an intercept in a model you will almost certainly end up with even more confusion in even a slightly more complicated model. I tend to use the default "treatment" coding in R and then use post-modeling tools like those in the R emmeans package to get the individual estimates I need.


*That's true for all regression models with "treatment" or "dummy" coding. If there's an intercept, then the Intercept is the estimated outcome at reference conditions and regression coefficients represent differences from the intercept.