Solved – predicted counts for Zero-Inflated Poisson model differ from original samples

poisson-regressionpythonstatsmodelszero inflation

While experimenting with statsmodels' Zero-Inflated Poisson count model using artificially generated data, I noticed that although the parameters used to generate the data for fitting were successfully recovered by the fitted model, the distribution of predicted counts for exogenous variable values generated in the same way appears to differ noticeably from the original counts (even when the total number of samples is the same as the number used for fitting). Any thoughts as to why this is the case? I'm using statsmodels 0.11.0 and Python 3.7.6.

import numpy as np
import statsmodels.discrete.count_model as cm

np.random.seed(0)

n = 100000
x0 = np.random.normal(6, 1, size=n)
x1 = np.random.normal(6, 1, size=n)
x = np.stack([x0, x1]).T
poisson_part = np.zeros(n)
zi_part = np.zeros(n)

p_params = np.array([0.2, -0.1])
p_part = np.random.poisson(np.exp(x @ p_params))
zi_params = np.array([0.3, -0.2])
zi_part = np.random.logistic(x @ zi_params) > 0

y = p_part*zi_part

mask = np.random.rand(n) <= 0.5
x_train = x[mask]
y_train = y[mask]
x_test = x[~mask]
y_test = y[~mask]

out = cm.ZeroInflatedPoisson(y_train, x_train, exog_infl=x_train)
res = out.fit()
y_test_pred = res.predict(x_test, exog_infl=x_test)
print(res.summary())
plt.clf()
plt.hist([y_test, y_test_pred], log=True, bins=max(y_test))
plt.legend(('orig','pred'))
plt.show()

Output:

                     ZeroInflatedPoisson Regression Results                    
===============================================================================
Dep. Variable:                       y   No. Observations:                49876
Model:             ZeroInflatedPoisson   Df Residuals:                    49874
Method:                            MLE   Df Model:                            1
Date:                 Wed, 19 Feb 2020   Pseudo R-squ.:                 0.03019
Time:                         08:47:05   Log-Likelihood:                -72939.
converged:                        True   LL-Null:                       -75209.
Covariance Type:             nonrobust   LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
inflate_x1    -0.3002      0.009    -34.492      0.000      -0.317      -0.283
inflate_x2     0.1984      0.009     22.874      0.000       0.181       0.215
x1             0.2051      0.003     65.802      0.000       0.199       0.211
x2            -0.1060      0.003    -31.216      0.000      -0.113      -0.099
==============================================================================

Original and predicted counts

Best Answer

The prediction that you are using is the expected value, i.e. the conditional expectation E(y | x)

This is the same as in standard count models like Poisson and in other models like GLM and linear models, where predict also returns the conditional expectation of the response variable.

y_test_pred = res.predict(x_test, exog_infl=x_test)

Here is a notebook that illustrates evaluating the predicted probabilities for different specifications of countmodels https://gist.github.com/josef-pkt/c932904296270d75366a24ee92a4eb2f

The documentation is not very clear, but predict in ZeroInflated models has a which keyword option:

   which : str, optional
        Define values that will be predicted.
        'mean', 'mean-main', 'linear', 'mean-nonzero', 'prob-zero, 'prob', 'prob-main'
        Default is 'mean'.

This gives us the different parts of the prediction distribution for the combination of zero-inflation model and main Poisson model.

The meaning of the different types of prediction are easier to see in the code https://www.statsmodels.org/stable/_modules/statsmodels/discrete/count_model.html#GenericZeroInflated.predict

Related Question