Solved – Poisson regression with (auto-correlated) time series

box-jenkinscount-datapoisson-regressionregressiontime series

I have a time series dataset which shows, for each day, the number of complaints received by an organization about a particular problem. I also have a number of other time series for the same period (mostly environmental variables like weather, chemistry etc.) which may help to explain the pattern of complaints.

My response variable is therefore discrete (number of complaints), while my possible explanatory variables are all continuous. Based on the advice of a colleague and a bit of Googling, it seems that some kind of Poisson regression model should be appropriate here. However, I'm running into a few difficulties:

  1. The variance of my count data (0.76) is greater than the mean (0.33). I think this indicates that my response variable is "over-dispersed" and that Poisson regression is therefore inappropriate?

  2. My count data is overwhelmingly dominated by zeros (i.e. most days have no complaints). It seems that this may be a problem for Poisson regression?

  3. Virtually all of my time series (both the response and explanatory variables) are auto-correlated.

  4. Some of my explanatory variables are correlated with each other. For example, stream flow is closely related to rainfall etc.

If I've understood what I've been reading correctly, all of this suggests that I'll need to do something a bit more involved than a "standard" Poisson regression.

My question: What is the simplest way to meaningfully analyze this kind of data, please? What are my options and what techniques should I be researching? I've been reading a little bit about Box-Jenkins models for count data (?), but I'm getting out of my depth. Is there anything more straightforward that's still rigorous?

The ultimate aim is to come up with some suggestions as to which factors might be causing the problem. If I could end up saying something like, "it appears that explanatory variables x, y and z are significantly correlated with the number of complaints" or "no combination of the explanatory variables can adequately explain the pattern of complaints", that would be useful.

Best Answer

I had a similar problem and was told to consult Chapter 4 of Regression Models for Time Series Analysis by Benjamin Kedem and Konstantinos Fokianos. I have not yet gotten around to digesting this book, but it looks highly relevant (though fairly technical) as far as I can tell.

I also wonder if this can be handled in a GLM framework with Poisson family, a log link function, and Newey-West standard errors. This is one line of code in Stata (after tsseting your data) and perhaps fairly doable in other packages. Here's a link to an old Stata Technical Bulletin article by James Hardin with the variance formulas for the probit, logit, and poisson. Perhaps one of the time-series mavens can comment on whether this would be a terrible idea.