Solved – How to model non-negative zero-inflated continuous data

regressiontobit-regressiontweedie-distributionzero inflation

I'm currently trying to apply a linear model (family = gaussian) to an indicator of biodiversity that cannot take values lower than zero, is zero-inflated and is continuous. Values range from 0 to a little over 0.25. As a consequence, there is quite an obvious pattern in the residuals of the model that I haven't managed to get rid of:
enter image description here

Does anyone have any ideas on how to solve this?

Best Answer

There are a variety of solutions to the case of zero-inflated (semi-)continuous distributions:

  • Tobit regression: assumes that the data come from a single underlying Normal distribution, but that negative values are censored and stacked on zero (e.g. censReg package). Here is a good book about Tobit model, see chapters 1 and 5.
  • see this answer for other censored-Gaussian alternatives
  • hurdle or "two-stage" model: use a binomial model to predict whether the values are 0 or >0, then use a linear model (or Gamma, or truncated Normal, or log-Normal) to model the observed non-zero values (typically you need to roll your own by running two separate models; combined versions where you fit the zero component and the non-zero component at the same time exist for count distributions such as Poisson (e.g glmmTMB, pscl); glmmTMB will also do 'zero-inflated'/hurdle models for Beta or Gamma responses)
  • Tweedie distributions: distributions in the exponential family that for a given range of shape parameters ($1<p<2$) have a point mass at zero and a skewed positive distribution for $x>0$ (e.g. tweedie, cplm, glmmTMB packages)

Or, if your data structure is simple enough, you could just use linear models and use permutation tests or some other robust approach to make sure that your inference isn't being messed up by the interesting distribution of the data.

There are R packages/solutions available for most of these cases.

There are other questions on SE about zero-inflated (semi)continuous data (e.g. here, here, and here), but they don't seem to offer a clear general answer ...

See also Min & Agresti, 2002, Modeling Nonnegative Data with Clumping at Zero: A Survey for an overview.

Related Question