Solved – Are the data zero inflated

rzero inflation

I'm building a function/package. If the users' entry data are zero inflated then the code will log-normalise them, process them, and reverse-log-normalise and bias-correct them afterwards. If they're not zero inflated it'll just process them.

Does anyone know of a test for zero inflation? Everyone knows what it IS, but the only similar question went semi-off-topic and just solved the guy's problem (How to test/prove data is zero inflated?).

Ideally I'd like something like this:
ifelse(iszeroinflated(data)=true, data_to_use<-log1p(data), data_to_use<-data)

But if that doesn't exist then whatever makes logical sense to be manually put in place of "iszeroinflated". What do we think? 30% of data are zeroes? 50? More? Less? And any consideration of the shape of the probability distribution other than the zeroes? One would expect lots of non-zero low numbers as well right? And only a few high ones?

Best Answer

Zero-inflation is about the shape of the distribution. Therefore, you will have to specify the distribution for the non-zero part (Poisson, Negative Binomial, etc), if you want a formal test. Then you can use a likelihood ratio test to see if the zero-inflated parameters can be dropped from the model. This can be done in R.

In cruder terms, zero inflation is defined not only by proportion of zeros but also by the total number of observations. Say, if you assume a zero-inflated Poisson model and your data contain 50% of zeros, you still won't be able to say with certainty that it's zero inflated if the total number of points is only 4. On the other hand, 10% of zeros in 1000 observations can result in a positive test for zero-inflation.

Zero-inflated property is associated with count-based data, so I haven't heard of "zero-inflated normal". E.g. in this package:

cran.r-project.org/web/packages/pscl/pscl.pdf

they only consider Poisson, Negative Binomial and Geometric. What I would do is fit, say, a Poisson model where the zero and non-zero components contain only the intercept and then check if the intercept from the zero component has a significant p-value.

P.S. I also managed to find a reference to the (log) normal zero-inflated distribution, but I don't know if it's obtainable for free:

"Analysis of repeated measures data with clumping at zero", Stat Methods Med Res, August 2002

http://smm.sagepub.com/content/11/4/341.full.pdf+html

Related Solutions

Solved – Trouble finding good model fit for count data with mixed effects – ZINB or something else

This post has four years, but I wanted to follow on what fsociety said in a comment. Diagnosis of residuals in GLMMs is not straightforward, since standard residual plots can show non-normality, heteroscedasticity, etc., even if the model is correctly specified. There is an R package, DHARMa, specifically suited for diagnosing these type of models.

The package is based on a simulation approach to generate scaled residuals from fitted generalized linear mixed models and generates different easily interpretable diagnostic plots. Here is a small example with the data from the original post and the first fitted model (m1):

library(DHARMa)
sim_residuals <- simulateResiduals(m1, 1000)
plotSimulatedResiduals(sim_residuals)

The plot on the left shows a QQ plot of the scaled residuals to detect deviations from the expected distribution, and the plot on the right represents residuals vs predicted values while performing quantile regression to detect deviations from uniformity (red lines should be horizontal and at 0.25, 0.50 and 0.75).

Additionally, the package has also specific functions for testing for over/under dispersion and zero inflation, among others:

testOverdispersionParametric(m1)

Chisq test for overdispersion in GLMMs

data:  poisson
dispersion = 0.18926, pearSS = 11.35600, rdf = 60.00000, p-value = 1
alternative hypothesis: true dispersion greater 1

testZeroInflation(sim_residuals)

DHARMa zero-inflation test via comparison to expected zeros with 
simulation under H0 = fitted model


data:  sim_residuals
ratioObsExp = 0.98894, p-value = 0.502
alternative hypothesis: more

Best Answer

Related Solutions

Solved – Trouble finding good model fit for count data with mixed effects – ZINB or something else

Related Question