Solved – Variable selection in zero-inflation models

count-datazero inflation

I am trying to understand how to perform model comparison between different count models. In this example the author performs a zero-inflated poisson model testing the effect of number of people in a group (variable group), number of children in a group (variable child), and whether or not the group had a camper (variable 'camper`)

library(pscl)
library(dplyr)
zinb <- read.csv("https://stats.idre.ucla.edu/stat/data/fish.csv") %>% transform(camper = factor(camper))

They perform the analysis using the following syntax

summary(m1 <- zeroinfl(count ~ child + camper | persons, data = zinb))

And get the output

# Call:
# zeroinfl(formula = count ~ child + camper | persons, data = zinb)

# Pearson residuals:
#     Min      1Q  Median      3Q     Max 
# -1.2369 -0.7540 -0.6080 -0.1921 24.0847 

# Count model coefficients (poisson with log link):
#             Estimate Std. Error z value Pr(>|z|)    
# (Intercept)  1.59789    0.08554  18.680   <2e-16 ***
# child       -1.04284    0.09999 -10.430   <2e-16 ***
# camper1      0.83402    0.09363   8.908   <2e-16 ***

# Zero-inflation model coefficients (binomial with logit link):
#             Estimate Std. Error z value Pr(>|z|)    
# (Intercept)   1.2974     0.3739   3.470 0.000520 ***
# persons      -0.5643     0.1630  -3.463 0.000534 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

# Number of iterations in BFGS optimization: 12 
# Log-likelihood: -1032 on 5 Df

Question 1

What does the | persons section of the model specification do?

It obviously consigns the persons variable to the zero inflation portion of the output, presumably because they believe that variable contains (or is hiding) some of the information that contributes to the excess zeroes, namely that the reason why some people do not catch fish is because those people do not even try to fish.

The author then goes on to perform a standard poisson regression

summary(p1 <- glm(count ~ child + camper, family = poisson, data = zinb))

# Call:
#   glm(formula = count ~ child + camper, family = poisson, data = zinb)
# 
# Deviance Residuals: 
#   Min       1Q   Median       3Q      Max  
# -3.7736  -2.2293  -1.2024  -0.3498  24.9492  
# 
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)    
#   (Intercept)  0.91026    0.08119   11.21   <2e-16 ***
#   child       -1.23476    0.08029  -15.38   <2e-16 ***
#   camper1      1.05267    0.08871   11.87   <2e-16 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for poisson family taken to be 1)
# 
# Null deviance: 2958.4  on 249  degrees of freedom
# Residual deviance: 2380.1  on 247  degrees of freedom
# AIC: 2723.2
# 
# Number of Fisher Scoring iterations: 6

Note the omission of the |persons term in this model. There is no explanation of why the persons variable was ommitted. This is especially confusing as they then go on to compare the standard poisson to the zero-inflated poisson using Vuong's non-nested test

vuong(p1, m1)

Question 2

How can we compare the two models when the standard poisson omits the |persons from the model specification?

Question 3

Should all zero-inflated models split the model this way, by identifying a variable that might contain/conceal/approximate the reason for the excess zeroes?

Question 4

In the above example the variable that might identify the reason for the excess zeros (how many people in the group actually tried to catch a fish) is not available. But if you do have that variable available should you include it in the model?

Best Answer

You've got a lot of questions here.

1. what does | persons do?

As you suggest, it sets the zero-inflation model to $\textrm{logit}(p_z) = \beta_{z0} + \beta_{z1} \cdot \textrm{persons}$.

The authors decided that the number of people in the group could affect whether the group didn't fish at all, but wouldn't affect the fishing success if the group did fish.

In the corresponding Poisson model, the zero-inflation term is completely missing (i.e. $p_z=0$, which would correspond to $\beta_{z0} \to -\infty$ (and $\beta_{z1}$ irrelevant).

2. How can we compare models?

I don't see the problem here. We have clear differences in model complexity (if we want to use AIC), or clear nesting (conditions under which the Poisson model is a special case of the ZIP model): note there is some debate about using Vuong tests for ZIP vs Poisson comparisons ...

3. should all Z-I models use separate covariates for the conditional vs zero-inflation components?

No, not necessarily. In principle either component of the model could be anything from intercept-only to including all available covariates (this is the default in zeroinfl: e.g. zeroinfl(r ~ x + y + z, ...) assumes that all three covariates affect both parts of the model). The choice should be based on (1) specifying an appropriately complex model for the data and (2) "subject-area considerations": using knowledge of what the parameters mean and which components they might sensibly affect.

4. should 'number of active fishers' be included in the model?

I'm not sure. If you know when the number of fishers is zero (in which case there are never any fish caught) vs. non-zero, then you might not want to use a ZIP at all. On the other hand, it might be reasonable to include the number of fishers in the conditional mean model ($\textrm{log}(\textrm{fishers})$ as an offset or covariate)

Related Question