Solved – Difference between temporal trends

count-datatrend

I have a dataset with observations form a population that have been described graphically as an annual trend. For example – male infection rate per year, female infection rate per year.

The infections are aggregated count data and there are population based numerator data available.

Which statistical test would I use to determine if the male infection temporal trend is different than the female infection temporal trend.

Thanks.

Best Answer

Let's start with some considerations:

One usually begins with simple reasonable models, as suggested by theory and restricted by data limitations, and moves to more complex models only if the simpler ones are inadequate. This is how statistical analysis operationalizes the scientific call for parsimony.
Fitting a trend is a form of regression analysis.
Because you have count data, you would naturally first consider binomial regression or Poisson regression. The first is appropriate in any case, while the latter is an excellent approximation for relatively low rates (which is what one hopes with infections!) and is widely available in software. (Ordinary least squares (OLS) is a further approximation that would be valid provided all the annual infection counts are fairly large, say in the tens to hundreds or more, and the infection counts are fairly constant over time.)
When a longish time series of data is available (usually 20-30+ years), you can consider using time series analysis to help account for correlations in rates from year to year. Usually, though, you would first exhaust plausible regression models to account for nonlinear changes over time, perhaps by including quadratic terms, "level shifts," or (more generally) splines. Note that the flexibility to model changes in slope over time is built in to all forms of regression; it is not a special feature of some particular approach.

In any of the regression models you can include separate terms for the male and female trends. This is done by introducing male/female as a covariate by means of "indicator" or "dummy variable" coding and including them as interactions. This has recently been discussed on this site here and here, where you can find the statistical model explicitly stated.

In the extreme case where (a) you contemplate the possibility of all regression coefficients differing between the two groups (the intercept and the slope and the coefficients of any other covariates) and (b) you are using the OLS approximation, this analysis reduces to the Chow Test. The link is to a nice exposition by William Gould, who provides plain-spoken advice ("I blame ... teachers for ... unnecessary jargon") and clear examples. Don't worry that the software is Stata; the output is what matters and it's standard.

Related Solutions

Solved – How to handle count data (categorical data), when it has been converted to a rate

For me it does not at all sound appropriate to use a chi-square test here.

I guess what you wanna do is the following: You have different wards or treatments or whatever else kind of nominal variable (i.e., groups) that divides your data. For each of these groups you collected the Infection Count and the Patient Bed Days to calculate the infection per patient bed days. Know you wanna check for differences between the groups, right?

If so, an analysis of variance (ANOVA, in case of more than two groups) or a t-test (in case of two groups) is probably appropriate given by the reasons in Srikant Vadali's post (and if the assumptions homogeneity of variances and comparable groups sizes are also met) and the beginner tag should be added.

Seasonality – Is This an Appropriate Method to Test for Seasonal Effects in Suicide Count Data?

What about a Poisson regression?

I created a data frame containing your data, plus an index t for the time (in months) and a variable monthdays for the number of days in each month.

T <- read.table("suicide.txt", header=TRUE)
U <- data.frame( year = as.numeric(rep(rownames(T),each=12)), 
         month = rep(colnames(T),nrow(T)), 
         t = seq(0, length = nrow(T)*ncol(T)), 
         suicides = as.vector(t(T)))
U$monthdays <- c(31,28,31,30,31,30,31,31,30,31,30,31)
U$monthdays[ !(U$year %% 4) & U$month == "Feb" ] <- 29

So it looks like this:

> head(U,14)
   year month  t suicides monthdays
1  1995   Jan  0       62        31
2  1995   Feb  1       47        28
3  1995   Mar  2       55        31
4  1995   Apr  3       74        30
5  1995   May  4       71        31
6  1995   Jun  5       70        30
7  1995   Jul  6       67        31
8  1995   Aug  7       69        31
9  1995   Sep  8       61        30
10 1995   Oct  9       76        31
11 1995   Nov 10       68        30
12 1995   Dec 11       68        31
13 1996   Jan 12       64        31
14 1996   Feb 13       69        29

Now let’s compare a model with a time effect and a number of days effect with a model in which we add a month effect:

> a0 <- glm( suicides ~ t + monthdays, family="poisson", data = U )
> a1 <- glm( suicides ~ t + monthdays + month, family="poisson", data = U )

Here is the summary of the "small" model:

> summary(a0)

Call:
glm(formula = suicides ~ t + monthdays, family = "poisson", data = U)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-2.7163  -0.6865  -0.1161   0.6363   3.2104

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.8060135  0.3259116   8.610  < 2e-16 ***
t           0.0013650  0.0001443   9.461  < 2e-16 ***
monthdays   0.0418509  0.0106874   3.916 9.01e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 302.67  on 203  degrees of freedom
Residual deviance: 196.64  on 201  degrees of freedom
AIC: 1437.2

Number of Fisher Scoring iterations: 4

You can see that the two variables have largely significant marginal effects. Now look at the larger model:

> summary(a1)

Call:
glm(formula = suicides ~ t + monthdays + month, family = "poisson",
    data = U)

Deviance Residuals:
     Min        1Q    Median        3Q       Max
-2.56164  -0.72363  -0.05581   0.58897   3.09423

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)  1.4559200  2.1586699   0.674    0.500
t            0.0013810  0.0001446   9.550   <2e-16 ***
monthdays    0.0869293  0.0719304   1.209    0.227
monthAug    -0.0845759  0.0832327  -1.016    0.310
monthDec    -0.1094669  0.0833577  -1.313    0.189
monthFeb     0.0657800  0.1331944   0.494    0.621
monthJan    -0.0372652  0.0830087  -0.449    0.653
monthJul    -0.0125179  0.0828694  -0.151    0.880
monthJun     0.0452746  0.0414287   1.093    0.274
monthMar    -0.0638177  0.0831378  -0.768    0.443
monthMay    -0.0146418  0.0828840  -0.177    0.860
monthNov    -0.0381897  0.0422365  -0.904    0.366
monthOct    -0.0463416  0.0830329  -0.558    0.577
monthSep     0.0070567  0.0417829   0.169    0.866
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 302.67  on 203  degrees of freedom
Residual deviance: 182.72  on 190  degrees of freedom
AIC: 1445.3

Number of Fisher Scoring iterations: 4

Well, of course the monthdays effect vanishes; it can be estimated only thanks to leap years!! Keeping it in the model (and taking into account leap years) allows to use the residual deviances to compare the two models.

> anova(a0, a1, test="Chisq")
Analysis of Deviance Table

Model 1: suicides ~ t + monthdays
Model 2: suicides ~ t + monthdays + month
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       201     196.65
2       190     182.72 11   13.928    0.237

So, no (significant) month effect? But what about a seasonal effect? We can try to capture seasonality using two variables $\cos\left( {2\pi t \over 12}\right)$ and $\sin\left( {2\pi t \over 12}\right)$:

> a2 <- glm( suicides ~ t + monthdays + cos(2*pi*t/12) + sin(2*pi*t/12),
             family="poisson", data = U )
> summary(a2)

Call:
glm(formula = suicides ~ t + monthdays + cos(2 * pi * t/12) +
    sin(2 * pi * t/12), family = "poisson", data = U)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-2.4782  -0.7095  -0.0544   0.6471   3.2236

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)
(Intercept)         2.8676170  0.3381954   8.479  < 2e-16 ***
t                   0.0013711  0.0001444   9.493  < 2e-16 ***
monthdays           0.0397990  0.0110877   3.589 0.000331 ***
cos(2 * pi * t/12) -0.0245589  0.0122658  -2.002 0.045261 *
sin(2 * pi * t/12)  0.0172967  0.0121591   1.423 0.154874
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 302.67  on 203  degrees of freedom
Residual deviance: 190.37  on 199  degrees of freedom
AIC: 1434.9

Number of Fisher Scoring iterations: 4

Now compare it with the null model:

> anova(a0, a2, test="Chisq")
Analysis of Deviance Table

Model 1: suicides ~ t + monthdays
Model 2: suicides ~ t + monthdays + cos(2 * pi * t/12) + sin(2 * pi *
    t/12)
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       201     196.65                   
2       199     190.38  2   6.2698   0.0435 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So, one can surely say that this suggests a seasonal effect...

Best Answer

Related Solutions

Solved – How to handle count data (categorical data), when it has been converted to a rate

Seasonality – Is This an Appropriate Method to Test for Seasonal Effects in Suicide Count Data?

Related Question