Zero-Inflated Gaussian – Understanding Zero-Inflated Gaussian for Weights Below Zero Recorded as 0

negative-binomial-distributionregressionzero inflation

I'm aware of the general idea behind zero-inflated model, and have used the zero-inflated Poisson and negative binomial. However, the data I currently have has a little different format that makes me think these may not be good choices.

The dependent variable is weight gain. However the data is truncated at 0, so anyone who didn't gain any weight was just marked as "no weight gain." Everyone else has the weight gained in pounds. I could treat this as negative binomial counts, as the weight gain is rounded to integer lbs, but this seems incorrect since the variable is fundamentally continuous.

Is a zero-inflated gaussian a good fit/even possible? Does anyone know of an implementation of this type of model in R?

Best Answer

I think the model is more appropriately a left-censored Gaussian, since the process you describe is about discarding information below some value (in this case, the location is known to be 0, which is simpler than the case of an unknown censoring value). In other words, there's some real quantity which can (hypothetically) be measured, but that quantity is not recorded. We need to use a modeling tool that reflects that there is some true, non-censored value, but that this value is not available to us.

One resource I happen to have on my bookshelf is Gelman et al., Bayesian Data Analysis (3rd edition). Censoring and truncation models are discussed starting on page 224. The authors write

Suppose an object is weighed 100 times on an electronic scale with a known measurement distribution $\mathcal{N}(\theta,1^2)$, where $\theta$ is the true weight of the object....

[T]he scale has an upper limit of 200 kg for reports: all values above 200kg are reported as "too heavy." The complete data are still $\mathcal{N}(\theta,1^2)$, but the observed data are censored; if we observe "too heavy," we know that it corresponds to a weighing with a reading above 200.

This is very similar to the problem as the one stated by OP, with the exception that it's censored above 200 instead of below 0, and the concept that each item is weighed repeatedly with some instrument error.

One R package that seems relevant is censReg.

Arne Henningsen. "Estimating Censored Regression Models in Rusing the censReg Package"

We demonstrate how censored regression models (including standard Tobit models) can be estimated in R using the add-on package censReg. This package provides not only the usual maximum likelihood (ML) procedure for cross-sectional data but also the random-effects maximum likelihood procedure for panel data using Gauss-Hermite quadrature.

I haven't used it, so I can't vouch for its quality or utility in this problem. There are probably lots of other options. The approach taken in Bayesian Data Analysis is to just code up your own model, either using the base library, or using stan. This has the greatest degree of flexibility, at the cost of having to do the coding yourself.

Related Solutions

Solved – Zero-inflated count models in R: what is the real advantage

I think this is a poorly chosen data set for exploring the advantages of zero inflated models, because, as you note, there isn't that much zero inflation.

plot(fitted(fm_pois), fitted(fm_zinb))

shows that the predicted values are almost identical.

In data sets with more zero-inflation, the ZI models give different (and usually better fitting) results than Poisson.

Another way to compare the fit of the models is to compare the size of residuals:

boxplot(abs(resid(fm_pois) - resid(fm_zinb)))

shows that, even here, the residuals from the Poisson are smaller than those from the ZINB. If you have some idea of a magnitude of the residual that is really problematic, you can see what proportion of the residuals in each model are above that. E.g. if being off by more than 1 was unacceptable

sum(abs(resid(fm_pois) > 1))
sum(abs(resid(fm_zinb) > 1))

shows the latter is a bit better - 20 fewer large residuals.

Then the question is whether the added complexity of the models is worth it to you.

Biostatistics – Expected Value and Model Expression for Zero Inflated Negative Binomial

The log-likelihood for a zero-inflated negative-binomial model can be written as:

$$\mathcal{L} = \left\{ \begin{array}{ll} \sum_{i=1}^{n} \left[ ln(p_{i}) + (1 – p_i)\left(\frac{1}{1 + \alpha\mu_{i}}\right)^{\frac{1}{\alpha}} \right] &\mbox{if } y_{i} = 0 \\ \sum_{i=1}^{n} \left[ ln(p_{i}) + ln\Gamma\left(\frac{1}{\alpha} + y_i\right) – ln\Gamma(y_i + 1) – ln\Gamma\left(\frac{1}{\alpha}\right) + \left(\frac{1}{\alpha}\right)ln\left(\frac{1}{1 + \alpha\mu_{i}}\right) + y_iln\left(1 – \frac{1}{1 + \alpha\mu_{i}}\right) \right] &\mbox{if } y_{i} > 0 \end{array} \right. $$

where $y_i$ is the observed count, $p_i$ is the probability from the logistic zero-inflation part of the model, $\alpha$ is the dispersion parameter for the negative-binomial model, and $\mu_i$ is the mean conditional on covariate values for the negative-binomial model. The $\mu_i$ are typically modeled with a log link; see this page among others. Predictors for the logistic and negative-binomial parts of the model can differ.

Best Answer

Related Solutions

Solved – Zero-inflated count models in R: what is the real advantage

Biostatistics – Expected Value and Model Expression for Zero Inflated Negative Binomial

Related Question