Zero-Inflated Gaussian – Understanding Zero-Inflated Gaussian for Weights Below Zero Recorded as 0

negative-binomial-distributionregressionzero inflation

I'm aware of the general idea behind zero-inflated model, and have used the zero-inflated Poisson and negative binomial. However, the data I currently have has a little different format that makes me think these may not be good choices.

The dependent variable is weight gain. However the data is truncated at 0, so anyone who didn't gain any weight was just marked as "no weight gain." Everyone else has the weight gained in pounds. I could treat this as negative binomial counts, as the weight gain is rounded to integer lbs, but this seems incorrect since the variable is fundamentally continuous.

Is a zero-inflated gaussian a good fit/even possible? Does anyone know of an implementation of this type of model in R?

Best Answer

I think the model is more appropriately a left-censored Gaussian, since the process you describe is about discarding information below some value (in this case, the location is known to be 0, which is simpler than the case of an unknown censoring value). In other words, there's some real quantity which can (hypothetically) be measured, but that quantity is not recorded. We need to use a modeling tool that reflects that there is some true, non-censored value, but that this value is not available to us.

One resource I happen to have on my bookshelf is Gelman et al., Bayesian Data Analysis (3rd edition). Censoring and truncation models are discussed starting on page 224. The authors write

Suppose an object is weighed 100 times on an electronic scale with a known measurement distribution $\mathcal{N}(\theta,1^2)$, where $\theta$ is the true weight of the object....

[T]he scale has an upper limit of 200 kg for reports: all values above 200kg are reported as "too heavy." The complete data are still $\mathcal{N}(\theta,1^2)$, but the observed data are censored; if we observe "too heavy," we know that it corresponds to a weighing with a reading above 200.

This is very similar to the problem as the one stated by OP, with the exception that it's censored above 200 instead of below 0, and the concept that each item is weighed repeatedly with some instrument error.

One R package that seems relevant is censReg.

Arne Henningsen. "Estimating Censored Regression Models in Rusing the censReg Package"

We demonstrate how censored regression models (including standard Tobit models) can be estimated in R using the add-on package censReg. This package provides not only the usual maximum likelihood (ML) procedure for cross-sectional data but also the random-effects maximum likelihood procedure for panel data using Gauss-Hermite quadrature.

I haven't used it, so I can't vouch for its quality or utility in this problem. There are probably lots of other options. The approach taken in Bayesian Data Analysis is to just code up your own model, either using the base library, or using stan. This has the greatest degree of flexibility, at the cost of having to do the coding yourself.

Related Question