Solved – predicted counts for Zero-Inflated Poisson model differ from original samples

poisson-regressionpythonstatsmodelszero inflation

While experimenting with statsmodels' Zero-Inflated Poisson count model using artificially generated data, I noticed that although the parameters used to generate the data for fitting were successfully recovered by the fitted model, the distribution of predicted counts for exogenous variable values generated in the same way appears to differ noticeably from the original counts (even when the total number of samples is the same as the number used for fitting). Any thoughts as to why this is the case? I'm using statsmodels 0.11.0 and Python 3.7.6.

import numpy as np
import statsmodels.discrete.count_model as cm

np.random.seed(0)

n = 100000
x0 = np.random.normal(6, 1, size=n)
x1 = np.random.normal(6, 1, size=n)
x = np.stack([x0, x1]).T
poisson_part = np.zeros(n)
zi_part = np.zeros(n)

p_params = np.array([0.2, -0.1])
p_part = np.random.poisson(np.exp(x @ p_params))
zi_params = np.array([0.3, -0.2])
zi_part = np.random.logistic(x @ zi_params) > 0

y = p_part*zi_part

mask = np.random.rand(n) <= 0.5
x_train = x[mask]
y_train = y[mask]
x_test = x[~mask]
y_test = y[~mask]

out = cm.ZeroInflatedPoisson(y_train, x_train, exog_infl=x_train)
res = out.fit()
y_test_pred = res.predict(x_test, exog_infl=x_test)
print(res.summary())
plt.clf()
plt.hist([y_test, y_test_pred], log=True, bins=max(y_test))
plt.legend(('orig','pred'))
plt.show()

Output:

                     ZeroInflatedPoisson Regression Results                    
===============================================================================
Dep. Variable:                       y   No. Observations:                49876
Model:             ZeroInflatedPoisson   Df Residuals:                    49874
Method:                            MLE   Df Model:                            1
Date:                 Wed, 19 Feb 2020   Pseudo R-squ.:                 0.03019
Time:                         08:47:05   Log-Likelihood:                -72939.
converged:                        True   LL-Null:                       -75209.
Covariance Type:             nonrobust   LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
inflate_x1    -0.3002      0.009    -34.492      0.000      -0.317      -0.283
inflate_x2     0.1984      0.009     22.874      0.000       0.181       0.215
x1             0.2051      0.003     65.802      0.000       0.199       0.211
x2            -0.1060      0.003    -31.216      0.000      -0.113      -0.099
==============================================================================

Best Answer

The prediction that you are using is the expected value, i.e. the conditional expectation E(y | x)

This is the same as in standard count models like Poisson and in other models like GLM and linear models, where predict also returns the conditional expectation of the response variable.

y_test_pred = res.predict(x_test, exog_infl=x_test)

Here is a notebook that illustrates evaluating the predicted probabilities for different specifications of countmodels https://gist.github.com/josef-pkt/c932904296270d75366a24ee92a4eb2f

The documentation is not very clear, but predict in ZeroInflated models has a which keyword option:

   which : str, optional
        Define values that will be predicted.
        'mean', 'mean-main', 'linear', 'mean-nonzero', 'prob-zero, 'prob', 'prob-main'
        Default is 'mean'.

This gives us the different parts of the prediction distribution for the combination of zero-inflation model and main Poisson model.

The meaning of the different types of prediction are easier to see in the code https://www.statsmodels.org/stable/_modules/statsmodels/discrete/count_model.html#GenericZeroInflated.predict

Related Solutions

Solved – Zero-inflated Poisson regression

In the zero-inflated Poisson case, if $\mathbf{B}=\mathbf{G}$, then $\beta$ and $\lambda$ both have the same length, which is the number of columns of $\mathbf{B}$ or $\mathbf{G}$. So the number of parameters is twice the number of columns of the design matrix ie twice the number of explanatory variables including the intercept (and whatever dummy coding was needed).

In a straight Poisson regression, there is no $\mathbf{p}$ vector to worry about, no need to estimate $\lambda$. So the number of parameters is just the length of $\beta$ ie half the number of parameters in the zero-inflated case.

Now, there's no particular reason why $\mathbf{B}$ has to equal $\mathbf{G}$, but generally it makes sense. However, one could imagine a data generating process where the chance of having any events at all is created by one process $\mathbf{G\lambda}$ and a completely different process $\mathbf{B\beta}$ drives how many events there are, given non-zero events. As a contrived example, I pick classrooms based on their History exam scores to play some unrelated game, and then observe the number of goals they score. In this case $\mathbf{B}$ might be quite different to $\mathbf{G}$ (if the things driving History exam scores are different to those driving performance in the game) and $\beta$ and $\lambda$ could have different lengths. $\mathbf{G}$ might have more columns than $\mathbf{B}$ or less. So the zero-inflated Poisson model in that case will have more parameters than a simple Poisson model.

In common practice I think $\mathbf{G} = \mathbf{B}$ most of the time.

Zero-Inflated Poisson Model – Comprehensive Understanding

Criterion is based upon (informed) model comparisons. You are trying to account for over-dispersion.

Poisson var(x) ~ mu

Neg Binomial var(x) > mu

"Extra" zeros

ZIP var(x) ~ mu

ZIPB var(x) > mu
One active package that you can use is install.packages("pscl") You can then fit a number of models such as a hurdle model that uses a negative binomial for the counts and a binomial model for modeling the probability of zeros. This would be written something like:
```
fit <- hurdle(Admission ~ Temperature + Humidity), dist="negbin", data = data)

 summary (fit)
```

Note that the output will have two sets of coefficients: one for the hurdle component and one for the count data. This output also provides an estimate of the theta parameter (overdispersion) of the negative binomial

Or you may want to look at the zero-inflation model

fit1<-zeroinfl(Admissions ~ Temperature + Humidity), data = data,dist="negbin",link="logit")

These models can be examined with AIC (also compare these models to your Poisson model...) AIC(fit,fit1)

Best Answer

Related Solutions

Solved – Zero-inflated Poisson regression

Zero-Inflated Poisson Model – Comprehensive Understanding

Related Question