Solved – Time series for count data, with counts < 20

count-dataepidemiologypoisson distributionrtime series

I recently started working for a tuberculosis clinic. We meet periodically to discuss the number of TB cases we're currently treating, the number of tests administered, etc. I'd like to start modeling these counts so that we're not just guessing whether something is unusual or not. Unfortunately, I've had very little training in time series, and most of my exposure has been to models for very continuous data (stock prices) or very large numbers of counts (influenza). But we deal with 0-18 cases per month (mean 6.68, median 7, var 12.3), which are distributed like this:

[image lost to the mists of time]

[image eaten by a grue]

I've found a few articles that address models like this, but I'd greatly appreciate hearing suggestions from you – both for approaches and for R packages that I could use to implement those approaches.

EDIT: mbq's answer has forced me to think more carefully about what I'm asking here; I got too hung-up on the monthly counts and lost the actual focus of the question. What I'd like to know is: does the (fairly visible) decline from, say, 2008 onward reflect a downward trend in the overall number of cases? It looks to me like the number of cases monthly from 2001-2007 reflects a stable process; maybe some seasonality, but overall stable. From 2008 through the present, it looks like that process is changing: the overall number of cases is declining, even though the monthly counts might wobble up and down due to randomness and seasonality. How can I test if there's a real change in the process? And if I can identify a decline, how could I use that trend and whatever seasonality there might be to estimate the number of cases we might see in the upcoming months?

Best Answer

To assess the historical trend, I'd use a gam with trend and seasonal components. For example

require(mgcv)
require(forecast)
x <- ts(rpois(100,1+sin(seq(0,3*pi,l=100))),f=12)
tt <- 1:100
season <- seasonaldummy(x)
fit <- gam(x ~ s(tt,k=5) + season, family="poisson")
plot(fit)

Then summary(fit) will give you a test of significance of the change in trend and the plot will give you some confidence intervals. The assumptions here are that the observations are independent and the conditional distribution is Poisson. Because the mean is allowed to change smoothly over time, these are not particularly strong assumptions.

To forecast is more difficult as you need to project the trend into the future. If you are willing to accept a linear extrapolation of the trend at the end of the data (which is certainly dodgy but probably ok for a few months), then use

fcast <- predict(fit,se.fit=TRUE,
               newdata=list(tt=101:112,season=seasonaldummyf(x,h=12)))

To see the forecasts on the same graph:

plot(x,xlim=c(0,10.5))
lines(ts(exp(fcast$fit),f=12,s=112/12),col=2)
lines(ts(exp(fcast$fit-2*fcast$se),f=12,s=112/12),col=2,lty=2)
lines(ts(exp(fcast$fit+2*fcast$se),f=12,s=112/12),col=2,lty=2)

You can spot the unusual months by looking for outliers in the (deviance) residuals of the fit.

Related Solutions

Solved – Time series analysis on login data to forecast CPU demand using R

Exponential smoothing is just a special case of an ARIMA model. If there is a benefit to fitting a general ARIMA model it is because of its generality and not that it handles gaps in the data any better than exponetial smoothing. I don't see any reason for throwing out February 29th. Individual dates would not have any appriciable effect on seasonality if there is some periodic component to the series. The time unit for time series analysis can be whatever time unit you measure the data in (it could be days weeks or years). You can cumulate data to create longer time intervals for the time series model. The fact that a date in one year falls on a different day of the week than in another year has nothing to do with its utility. If there are weekly effects this can show up in a 7 day periodic component. Gaps in the data does hurt your ability to fit the model. But if a single ARIMA model would have fit well to the complete series you probably can identify it piecing together the available portions of the series keeping count through the time index of the number of days missing at each gap. I don't understand why you can't have days as the time units. Is there a problem with having a long series because it seems to me that the time unit only affects the number of time points im the series?

Seasonality – Is This an Appropriate Method to Test for Seasonal Effects in Suicide Count Data?

What about a Poisson regression?

I created a data frame containing your data, plus an index t for the time (in months) and a variable monthdays for the number of days in each month.

T <- read.table("suicide.txt", header=TRUE)
U <- data.frame( year = as.numeric(rep(rownames(T),each=12)), 
         month = rep(colnames(T),nrow(T)), 
         t = seq(0, length = nrow(T)*ncol(T)), 
         suicides = as.vector(t(T)))
U$monthdays <- c(31,28,31,30,31,30,31,31,30,31,30,31)
U$monthdays[ !(U$year %% 4) & U$month == "Feb" ] <- 29

So it looks like this:

> head(U,14)
   year month  t suicides monthdays
1  1995   Jan  0       62        31
2  1995   Feb  1       47        28
3  1995   Mar  2       55        31
4  1995   Apr  3       74        30
5  1995   May  4       71        31
6  1995   Jun  5       70        30
7  1995   Jul  6       67        31
8  1995   Aug  7       69        31
9  1995   Sep  8       61        30
10 1995   Oct  9       76        31
11 1995   Nov 10       68        30
12 1995   Dec 11       68        31
13 1996   Jan 12       64        31
14 1996   Feb 13       69        29

Now let’s compare a model with a time effect and a number of days effect with a model in which we add a month effect:

> a0 <- glm( suicides ~ t + monthdays, family="poisson", data = U )
> a1 <- glm( suicides ~ t + monthdays + month, family="poisson", data = U )

Here is the summary of the "small" model:

> summary(a0)

Call:
glm(formula = suicides ~ t + monthdays, family = "poisson", data = U)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-2.7163  -0.6865  -0.1161   0.6363   3.2104

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.8060135  0.3259116   8.610  < 2e-16 ***
t           0.0013650  0.0001443   9.461  < 2e-16 ***
monthdays   0.0418509  0.0106874   3.916 9.01e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 302.67  on 203  degrees of freedom
Residual deviance: 196.64  on 201  degrees of freedom
AIC: 1437.2

Number of Fisher Scoring iterations: 4

You can see that the two variables have largely significant marginal effects. Now look at the larger model:

> summary(a1)

Call:
glm(formula = suicides ~ t + monthdays + month, family = "poisson",
    data = U)

Deviance Residuals:
     Min        1Q    Median        3Q       Max
-2.56164  -0.72363  -0.05581   0.58897   3.09423

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)  1.4559200  2.1586699   0.674    0.500
t            0.0013810  0.0001446   9.550   <2e-16 ***
monthdays    0.0869293  0.0719304   1.209    0.227
monthAug    -0.0845759  0.0832327  -1.016    0.310
monthDec    -0.1094669  0.0833577  -1.313    0.189
monthFeb     0.0657800  0.1331944   0.494    0.621
monthJan    -0.0372652  0.0830087  -0.449    0.653
monthJul    -0.0125179  0.0828694  -0.151    0.880
monthJun     0.0452746  0.0414287   1.093    0.274
monthMar    -0.0638177  0.0831378  -0.768    0.443
monthMay    -0.0146418  0.0828840  -0.177    0.860
monthNov    -0.0381897  0.0422365  -0.904    0.366
monthOct    -0.0463416  0.0830329  -0.558    0.577
monthSep     0.0070567  0.0417829   0.169    0.866
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 302.67  on 203  degrees of freedom
Residual deviance: 182.72  on 190  degrees of freedom
AIC: 1445.3

Number of Fisher Scoring iterations: 4

Well, of course the monthdays effect vanishes; it can be estimated only thanks to leap years!! Keeping it in the model (and taking into account leap years) allows to use the residual deviances to compare the two models.

> anova(a0, a1, test="Chisq")
Analysis of Deviance Table

Model 1: suicides ~ t + monthdays
Model 2: suicides ~ t + monthdays + month
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       201     196.65
2       190     182.72 11   13.928    0.237

So, no (significant) month effect? But what about a seasonal effect? We can try to capture seasonality using two variables $\cos\left( {2\pi t \over 12}\right)$ and $\sin\left( {2\pi t \over 12}\right)$:

> a2 <- glm( suicides ~ t + monthdays + cos(2*pi*t/12) + sin(2*pi*t/12),
             family="poisson", data = U )
> summary(a2)

Call:
glm(formula = suicides ~ t + monthdays + cos(2 * pi * t/12) +
    sin(2 * pi * t/12), family = "poisson", data = U)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-2.4782  -0.7095  -0.0544   0.6471   3.2236

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)
(Intercept)         2.8676170  0.3381954   8.479  < 2e-16 ***
t                   0.0013711  0.0001444   9.493  < 2e-16 ***
monthdays           0.0397990  0.0110877   3.589 0.000331 ***
cos(2 * pi * t/12) -0.0245589  0.0122658  -2.002 0.045261 *
sin(2 * pi * t/12)  0.0172967  0.0121591   1.423 0.154874
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 302.67  on 203  degrees of freedom
Residual deviance: 190.37  on 199  degrees of freedom
AIC: 1434.9

Number of Fisher Scoring iterations: 4

Now compare it with the null model:

> anova(a0, a2, test="Chisq")
Analysis of Deviance Table

Model 1: suicides ~ t + monthdays
Model 2: suicides ~ t + monthdays + cos(2 * pi * t/12) + sin(2 * pi *
    t/12)
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       201     196.65                   
2       199     190.38  2   6.2698   0.0435 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So, one can surely say that this suggests a seasonal effect...

Best Answer

Related Solutions

Solved – Time series analysis on login data to forecast CPU demand using R

Seasonality – Is This an Appropriate Method to Test for Seasonal Effects in Suicide Count Data?

Related Question