I have data where the dependent variable are counts of an event. I am modeling the relationship between the dependent and independent variables using a negative binomial model, but I was also hoping to try some machine learning or nonparametric models, specifically ones that will handle nonlinearity in the responses and predictors. Any suggestions? Thanks.
Solved – Nonparametric methods for count data
count-datanonparametric
Related Solutions
The variance inflation factor is a function of the predictor variables, independent of the outcome variable, as noted on this page. So if you know how to calculate VIF for any type of linear model, you can do the same here.*
Your zero-inflated model analysis provides values for statistical significance testing, showing that 3 of 4 predictors are significantly related to the outcome variable, regardless of any multicollinearity. It's not clear that you need to "address" further any multicollinearity from the hypothesis testing standpoint. There presumably would, however, be some interest in analyzing and displaying the relations among your predictor variables.
The bigger problem here is that you are analyzing four predictors with less than 20 cases (based on the reported degrees of freedom). Thus there is a severe danger of overfitting, finding a relation that works on this data set but does not generalize well. That's probably more important to address than multicollinearity among those predictors.
*There is an alternate method for calculating VIFs in generalized linear models, implemented for example in the R car
package. As shown here that VIF calculation is based on the coefficient variance-covariance matrix produced during the maximum-likelihood fitting of the model. For a standard linear regression the VIF values would be the same with either calculation, but the results can differ for generalized models.
If you have only some zeros and frequent high counts, then you may be able to use models for continuous data, even though your data are count. For instance, consider ARIMAX or regression with ARIMA errors (the two are not the same).
Best approach for count prediction in time-series? is related. However, I essentially write that I have not seen any count data models that include autoregression or similar "TS" structure. You may be able to "roll your own" negative binomial regression model with a mean that depends both on predictors and autoregressively on previous values. Negative Binomial Regression by Hilbe might be useful, but he doesn't consider time series models.
Shameless self-promotion: I wrote a little article (Kolassa, 2016, IJF) on forecasting sales count data with regressors which may be useful. However, for all the reasons above, I didn't model any autoregression. Then again, it's probably not necessary for my application, which is supermarkets and drug stores that serve a couple thousand households - even if each household has an autoregressive demand for toilet paper, I would expect the autoregression to disappear in the aggregate data generating process. Your problem may be similar, or different.
Finally, if you are unsatisfied with the quality of your predictions, How to know that your machine learning problem is hopeless? may be helpful.
Best Answer
In Cameron and Trivedi (2013) Chapter 11.6 is a section about nonparametric methods for count data. I haven't used nonparametric methods for count data sofar but it seems that most standard nonparametric methods such as kernel methods, nearest neighbor or spline regression are also available for count data.
You can find also some information in the np-package for R, see Chapter 4.
If you are mostly interested in getting consistent coefficients, than your results might be quite robust to the detailed functional form as long as the conditional mean is correctly specified (see Quasi Maximum Likelihood Methods, see again Cameron and Trivedi).
But probably the easiest way is to check if you need nonparametric methids is to use quantile regression methods for count data which seem to be available for Stata.