I am trying to understand how to perform model comparison between different count models. In this example the author performs a zero-inflated poisson model testing the effect of number of people in a group (variable group
), number of children in a group (variable child
), and whether or not the group had a camper (variable 'camper`)
library(pscl)
library(dplyr)
zinb <- read.csv("https://stats.idre.ucla.edu/stat/data/fish.csv") %>% transform(camper = factor(camper))
They perform the analysis using the following syntax
summary(m1 <- zeroinfl(count ~ child + camper | persons, data = zinb))
And get the output
# Call:
# zeroinfl(formula = count ~ child + camper | persons, data = zinb)
# Pearson residuals:
# Min 1Q Median 3Q Max
# -1.2369 -0.7540 -0.6080 -0.1921 24.0847
# Count model coefficients (poisson with log link):
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 1.59789 0.08554 18.680 <2e-16 ***
# child -1.04284 0.09999 -10.430 <2e-16 ***
# camper1 0.83402 0.09363 8.908 <2e-16 ***
# Zero-inflation model coefficients (binomial with logit link):
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 1.2974 0.3739 3.470 0.000520 ***
# persons -0.5643 0.1630 -3.463 0.000534 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Number of iterations in BFGS optimization: 12
# Log-likelihood: -1032 on 5 Df
Question 1
What does the | persons
section of the model specification do?
It obviously consigns the persons
variable to the zero inflation portion of the output, presumably because they believe that variable contains (or is hiding) some of the information that contributes to the excess zeroes, namely that the reason why some people do not catch fish is because those people do not even try to fish.
The author then goes on to perform a standard poisson regression
summary(p1 <- glm(count ~ child + camper, family = poisson, data = zinb))
# Call:
# glm(formula = count ~ child + camper, family = poisson, data = zinb)
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -3.7736 -2.2293 -1.2024 -0.3498 24.9492
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.91026 0.08119 11.21 <2e-16 ***
# child -1.23476 0.08029 -15.38 <2e-16 ***
# camper1 1.05267 0.08871 11.87 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for poisson family taken to be 1)
#
# Null deviance: 2958.4 on 249 degrees of freedom
# Residual deviance: 2380.1 on 247 degrees of freedom
# AIC: 2723.2
#
# Number of Fisher Scoring iterations: 6
Note the omission of the |persons
term in this model. There is no explanation of why the persons
variable was ommitted. This is especially confusing as they then go on to compare the standard poisson to the zero-inflated poisson using Vuong's non-nested test
vuong(p1, m1)
Question 2
How can we compare the two models when the standard poisson omits the |persons
from the model specification?
Question 3
Should all zero-inflated models split the model this way, by identifying a variable that might contain/conceal/approximate the reason for the excess zeroes?
Question 4
In the above example the variable that might identify the reason for the excess zeros (how many people in the group actually tried to catch a fish) is not available. But if you do have that variable available should you include it in the model?
Best Answer
You've got a lot of questions here.
1. what does
| persons
do?As you suggest, it sets the zero-inflation model to $\textrm{logit}(p_z) = \beta_{z0} + \beta_{z1} \cdot \textrm{persons}$.
The authors decided that the number of people in the group could affect whether the group didn't fish at all, but wouldn't affect the fishing success if the group did fish.
In the corresponding Poisson model, the zero-inflation term is completely missing (i.e. $p_z=0$, which would correspond to $\beta_{z0} \to -\infty$ (and $\beta_{z1}$ irrelevant).
2. How can we compare models?
I don't see the problem here. We have clear differences in model complexity (if we want to use AIC), or clear nesting (conditions under which the Poisson model is a special case of the ZIP model): note there is some debate about using Vuong tests for ZIP vs Poisson comparisons ...
3. should all Z-I models use separate covariates for the conditional vs zero-inflation components?
No, not necessarily. In principle either component of the model could be anything from intercept-only to including all available covariates (this is the default in
zeroinfl
: e.g.zeroinfl(r ~ x + y + z, ...)
assumes that all three covariates affect both parts of the model). The choice should be based on (1) specifying an appropriately complex model for the data and (2) "subject-area considerations": using knowledge of what the parameters mean and which components they might sensibly affect.4. should 'number of active fishers' be included in the model?
I'm not sure. If you know when the number of fishers is zero (in which case there are never any fish caught) vs. non-zero, then you might not want to use a ZIP at all. On the other hand, it might be reasonable to include the number of fishers in the conditional mean model ($\textrm{log}(\textrm{fishers})$ as an offset or covariate)