Generalized Linear Models – How to Handle Zero-Inflated Data with Negative Values

generalized linear modelnegative-binomial-distributionzero inflation

Participants in a survey answer two questions similar to this:

  • What is your estimate, how on average a woman performs in this test (from 0 to 100)?

  • What is your estimate, how on average a man performs in this test (from 0 to 100)?

The dependent variable is the difference between these two answers. Thus of course the DV is highly zero-inflated because the majority doesn't assume that there will be difference. Although depending on the treatment, the age, gender and education of the respondent the DV varies.

The models that are a good fit for zero inflated data mostly deal with counted data, thus they assume that DV is non-negative. Here (https://fukamilab.github.io/BIO202/04-C-zero-data.html) and here (https://stats.oarc.ucla.edu/r/dae/zinb/) for instance.

How should I deal with it?

Here is the distribution of DV across four treatments, for illustration.

enter image description here

Best Answer

The response variable in a "real" zero-inflated model does not have to be non-negative and finite; just in these cases that this condition does not hold, we usually call such a model a "mixture model" rather than a "zero-inflated model". If you haven't come across it the CV.SE thread on: What is the difference between zero-inflated and hurdle models? is great and highlights how a zero-inflated model works by having a composite likelihood based on a mixture of a Bernoulli distribution and a (truncated) Poisson distribution. In this use case instead of using a distribution for counts, we will use a more permissive distribution like a Gaussian or maybe a $t$-distribution (to account for potential outliers).

Implementing a mixture model is not always trivial but R has an extensively used package flexmix that we can utilise. To that extent, we can even be more aggressive and go "full Bayesian" and use Stan or any other probabilistic modelling framework (e.g. PyMC or Turing.jl) we wish; Stan has example of defining mixture models here (and even have a zero-inflated example application) and PyMC here.

Related Question