Solved – Implementing a hurdle/Zero-inflated Poisson model in R with right-censored count data

tobit-regressionzero inflation

I'm trying to fit a hurdle/zero-inflated (I haven't decided yet) model on microbiological water quality data that is also right-censored: either the water sample is contaminated with bacteria or not, and if contaminated, the number of colonies can go from 1 to 99 and "more than 100" (because it was not possible to count the number of colonies beyond 100 – often referred in microbiology as "TNTC = too numerous to count").

If I refer to the data example taken by Kleiber and Zeileis (https://www.statistik.uni-dortmund.de/useR-2008/slides/Kleiber+Zeileis.pdf) I'm in a situation where the "number of visits to the physician" is censored to let's say "30 visits or more".

Is there a way to combine a hurdle or zero-inflated model (from the package pscl for example) with this right censored data distribution ? A sort or combined hurdle / ZIP + tobit model ??

Thanks a lot for your help
Lily

Best Answer

First, don't forget that a large number of 0 values does not necessarily require a zero-inflated or hurdle model. If you have a similarly large number of cases whose predictor-variable values would be associated with low probability of contamination then your data might not be zero-inflated at all. For deciding between zero-inflated and hurdle models, don't miss this thread.

Second, if you do choose to use a hurdle model for the zeros then I understand that these are often fit in a two-step process: first the zero/positive dichotomous model and then the positive counts. In that case, after the zero/positive fit you would only have to deal with the right-censoring beyond 100 colonies in the second step, separately.

Third, the VGAM package in R includes a cens.poisson family function for censored count data, either or both right- or left-censored. Its handling of censoring is based on the survival package, so the outcome variable has to be a survival object with both the counts and a censoring indicator. I haven't used it and there may be some tricks in formatting data to take advantage of this defined family function; examine its help page for examples of how to use it properly.

Finally, you could do your own maximization of the likelihood function for a zero-inflated Poisson model with right censoring. I happened to find the formula in this paper. The maxLik package provides tools for such a purpose, although I haven't used it myself.

Related Solutions

Solved – Enormous SEs in zero-inflated negative binomial regression

(Hope this is an appropriate way to answer my own question; maybe this should be a comment?)

I figured out the problem thanks in part to a fantastic answer on count regression diagnostics. The offset was the culprit causing the huge SEs, and the residuals and other goodness-of-fit diagnostics were accordingly strange. The offset term may have been a problem because one of the countries (China) had a huge population but comparatively few cases.

For comparison, here is the ZINB without the offset:

enter image description here

Hope this helps anyone else experiencing similar problems.

Solved – Trouble finding good model fit for count data with mixed effects – ZINB or something else

This post has four years, but I wanted to follow on what fsociety said in a comment. Diagnosis of residuals in GLMMs is not straightforward, since standard residual plots can show non-normality, heteroscedasticity, etc., even if the model is correctly specified. There is an R package, DHARMa, specifically suited for diagnosing these type of models.

The package is based on a simulation approach to generate scaled residuals from fitted generalized linear mixed models and generates different easily interpretable diagnostic plots. Here is a small example with the data from the original post and the first fitted model (m1):

library(DHARMa)
sim_residuals <- simulateResiduals(m1, 1000)
plotSimulatedResiduals(sim_residuals)

The plot on the left shows a QQ plot of the scaled residuals to detect deviations from the expected distribution, and the plot on the right represents residuals vs predicted values while performing quantile regression to detect deviations from uniformity (red lines should be horizontal and at 0.25, 0.50 and 0.75).

Additionally, the package has also specific functions for testing for over/under dispersion and zero inflation, among others:

testOverdispersionParametric(m1)

Chisq test for overdispersion in GLMMs

data:  poisson
dispersion = 0.18926, pearSS = 11.35600, rdf = 60.00000, p-value = 1
alternative hypothesis: true dispersion greater 1

testZeroInflation(sim_residuals)

DHARMa zero-inflation test via comparison to expected zeros with 
simulation under H0 = fitted model


data:  sim_residuals
ratioObsExp = 0.98894, p-value = 0.502
alternative hypothesis: more

Best Answer

Related Solutions

Solved – Enormous SEs in zero-inflated negative binomial regression

Solved – Trouble finding good model fit for count data with mixed effects – ZINB or something else

Related Question