Solved – Multicollinearity in Zero Inflated Negative Binomial Regression

count-datamulticollinearitynegative-binomial-distributionregressionzero inflation

I am trying to model counts govt, based on the counts lp,const,opp and another independent variable govtno. govt has many zeros, so I am using a zero-inflated negative binomial regression. The counts lp,const,opp also have many zeros. The pairwise correlations between these counts might indicate the presence of multicollinearity between predictors lp,const,opp:

       govt     const       lp       opp
A   1.0000000 0.2883734 0.4135134 0.3913364
B   0.2883734 1.0000000 0.4138627 0.5478605
C   0.4135134 0.4138627 1.0000000 0.5315744
D   0.3913364 0.5478605 0.5315744 1.0000000

1) How can I really check for multicollinearity in this model? I do not know how to calculate VIFs for zero inflated regression models.
2) How can I address this multicollinearity? My final goal is to test significance of the predictors, so solution(s) should allow for statistical significance testing.

Here is the summary output of the zero-inflated negative binomial regression:

>summary(m4 <- zeroinfl(govt ~ govtno + const + lp + opp, data = dat1b.w.nc, dist="negbin"))

Call:
zeroinfl(formula = govt ~ govtno + const + lp + opp, data = dat1b.w.nc, dist = "negbin")

Pearson residuals:
     Min       1Q   Median       3Q      Max 
-0.71953 -0.14796 -0.11066 -0.08794 15.45473 

Count model coefficients (negbin with log link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.123334   0.272388  -0.453   0.6507    
govtno      -0.013671   0.006024  -2.269   0.0232 *  
const        0.028129   0.015127   1.860   0.0630 .  
lp           0.024683   0.014829   1.665   0.0960 .  
opp          0.155652   0.036760   4.234 2.29e-05 ***
Log(theta)  -0.639797   0.137549  -4.651 3.30e-06 ***

Zero-inflation model coefficients (binomial with logit link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.943508   0.351314  11.225  < 2e-16 ***
govtno      -0.027054   0.008617  -3.139  0.00169 ** 
const       -0.052898   0.057112  -0.926  0.35433    
lp          -1.045437   0.187422  -5.578 2.43e-08 ***
opp         -1.881200   0.349475  -5.383 7.33e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 0.5274 
Number of iterations in BFGS optimization: 37 
Log-likelihood: -1422 on 11 Df

Best Answer

The variance inflation factor is a function of the predictor variables, independent of the outcome variable, as noted on this page. So if you know how to calculate VIF for any type of linear model, you can do the same here.*

Your zero-inflated model analysis provides values for statistical significance testing, showing that 3 of 4 predictors are significantly related to the outcome variable, regardless of any multicollinearity. It's not clear that you need to "address" further any multicollinearity from the hypothesis testing standpoint. There presumably would, however, be some interest in analyzing and displaying the relations among your predictor variables.

The bigger problem here is that you are analyzing four predictors with less than 20 cases (based on the reported degrees of freedom). Thus there is a severe danger of overfitting, finding a relation that works on this data set but does not generalize well. That's probably more important to address than multicollinearity among those predictors.


*There is an alternate method for calculating VIFs in generalized linear models, implemented for example in the R car package. As shown here that VIF calculation is based on the coefficient variance-covariance matrix produced during the maximum-likelihood fitting of the model. For a standard linear regression the VIF values would be the same with either calculation, but the results can differ for generalized models.