Solved – Statistical tests for count data with many zeros

statistical significancet-testwilcoxon-mann-whitney-testzero inflation

I have to compare three groups (each group is a customer to a subscription box company).

Group A received treatment A.
Group B received treatment B.
Group C received no treatment.

We count the total number of boxes sold in each group (i.e. we are modelling count data)

I want to compare the mean number of boxes in group A and B vs the control (to see if the uplift is greater)

For this data ~2% of Group A got a box, ~2.1% of Group B got a box, and ~1.1% of Group C got a box.

What tests could we run to compare the means or the difference in uplift vs control for Group A and Group B?

I don't think a t-test or a Mann-Whitney test is appropriate here (though I am willing to be proven wrong!)

Best Answer

We are to gather "uplift" is basically number of boxes. If you want to compare the mean number of boxes you should use an ANOVA. If the sample size is reasonably large, the sampling distribution of the mean is approximately normally distributed.

Models for count data can be used as well, given that the outcome is count. Poisson regression is the most common. Having many 0s does not mean the data are not Poisson, the rate could just be low. Quasipoisson and negative binomial models both just scale the variance so that the mean is merely proportional to the variance, in all cases the effect is interpreted as a relative rate of number of boxes.

A rank based test will tell you nothing about the mean. Rank tests in general are not a panacea for violations of modeling assumptions (which is separate from "having lots of zeroes"). Inferring differences in mean does not require that an exact parametric model is specified, rather using robust or asymptotic statistics will give you valid inference about mean differences.

Related Question