Solved – Are the data zero inflated

rzero inflation

I'm building a function/package. If the users' entry data are zero inflated then the code will log-normalise them, process them, and reverse-log-normalise and bias-correct them afterwards. If they're not zero inflated it'll just process them.

Does anyone know of a test for zero inflation? Everyone knows what it IS, but the only similar question went semi-off-topic and just solved the guy's problem (How to test/prove data is zero inflated?).

Ideally I'd like something like this:
ifelse(iszeroinflated(data)=true, data_to_use<-log1p(data), data_to_use<-data)

But if that doesn't exist then whatever makes logical sense to be manually put in place of "iszeroinflated". What do we think? 30% of data are zeroes? 50? More? Less? And any consideration of the shape of the probability distribution other than the zeroes? One would expect lots of non-zero low numbers as well right? And only a few high ones?

Best Answer

Zero-inflation is about the shape of the distribution. Therefore, you will have to specify the distribution for the non-zero part (Poisson, Negative Binomial, etc), if you want a formal test. Then you can use a likelihood ratio test to see if the zero-inflated parameters can be dropped from the model. This can be done in R.

In cruder terms, zero inflation is defined not only by proportion of zeros but also by the total number of observations. Say, if you assume a zero-inflated Poisson model and your data contain 50% of zeros, you still won't be able to say with certainty that it's zero inflated if the total number of points is only 4. On the other hand, 10% of zeros in 1000 observations can result in a positive test for zero-inflation.

Zero-inflated property is associated with count-based data, so I haven't heard of "zero-inflated normal". E.g. in this package:

cran.r-project.org/web/packages/pscl/pscl.pdf

they only consider Poisson, Negative Binomial and Geometric. What I would do is fit, say, a Poisson model where the zero and non-zero components contain only the intercept and then check if the intercept from the zero component has a significant p-value.

P.S. I also managed to find a reference to the (log) normal zero-inflated distribution, but I don't know if it's obtainable for free:

"Analysis of repeated measures data with clumping at zero", Stat Methods Med Res, August 2002

http://smm.sagepub.com/content/11/4/341.full.pdf+html

Related Question