Solved – Fitting continuous data with zeros to a discrete distribution

continuous datacount-datadistributions

I have data on the abundance of a particular organism across a sampling area. However, instead of counts, I have the estimated biomass of the organism at at each sampling location (that is, the estimated total weight of all the organisms at that location, but not the actual number of organisms). I know that a Poisson or negative binomial model is often appropriate for count data, and the NB seems particularly appropriate for these data (the variance large relative to mean and the species is known to be spatially aggregated). Can continuous data that is really an index of of a discrete variable be modeled using a continuous distribution? I've found one or two papers where biomass data was fit to a NB distribution but they are light on details and statistical justification.

Edit: added plot of the ecdf.ECDF

Best Answer

If the counts are all likely to be large, the main potential issue I see here is the variance function, since you don't have anything that scales the biomass to an actual count. It's like having a noisy scaled count without knowing the scaling factor. That may not be such an issue with the negative binomial as it is with the Poisson, though.

If you have some atoms of probability but the data are otherwise continuous you have a mixed distribution (a mixture of continuous and discrete); when the only atom is at zero, it's sometimes called a zero-inflated continuous distribution.

Zero-inflated gamma and Zero-inflated lognormal distributions are commonly used; either might suit your case. Typical models include zero-inflated and hurdle models (yes, the term zero-inflated is overloaded). These are often applied to discrete data (e.g. for otherwise Poisson data you have Poisson hurdle and zero-inflated Poisson, or ZIP models), where the models are different in how they treat zeros, but the distinction is less clearly drawn for continuous models; but if I used different variables to model the zeros from the model for the continuous part I'd tend to call it a hurdle model rather than zero-inflated. If I used the same form of linear predictor (but with different betas), or if I had a constant probability of zero, I'd probably call it a zero-inflated model -- however, I'm not an expert on such models, so you may be better off following other people's way of dividing up models for continuous zero-inflated data.

There are some posts on our site relating to zero-inflated gamma models and other zero-inflated distributions, and on continuous zero-inflated and/or hurdle models.

On this page, Sean Anderson talks about gamma hurdle models and specifically mentions its use for modelling biomass.


Portion of older answer given under the original post (which stated the distribution was continuous):

I'd be inclined to model it as a gamma; it's continuous, and it arguably has roughly similar properties to the negative binomial.

Is there a particular reason you need the negative binomial?

Related Question