What about a Poisson regression?
I created a data frame containing your data, plus an index t
for the time (in months) and a variable monthdays
for the number of days in each month.
T <- read.table("suicide.txt", header=TRUE)
U <- data.frame( year = as.numeric(rep(rownames(T),each=12)),
month = rep(colnames(T),nrow(T)),
t = seq(0, length = nrow(T)*ncol(T)),
suicides = as.vector(t(T)))
U$monthdays <- c(31,28,31,30,31,30,31,31,30,31,30,31)
U$monthdays[ !(U$year %% 4) & U$month == "Feb" ] <- 29
So it looks like this:
> head(U,14)
year month t suicides monthdays
1 1995 Jan 0 62 31
2 1995 Feb 1 47 28
3 1995 Mar 2 55 31
4 1995 Apr 3 74 30
5 1995 May 4 71 31
6 1995 Jun 5 70 30
7 1995 Jul 6 67 31
8 1995 Aug 7 69 31
9 1995 Sep 8 61 30
10 1995 Oct 9 76 31
11 1995 Nov 10 68 30
12 1995 Dec 11 68 31
13 1996 Jan 12 64 31
14 1996 Feb 13 69 29
Now let’s compare a model with a time effect and a number of days effect with a model in which we add a month effect:
> a0 <- glm( suicides ~ t + monthdays, family="poisson", data = U )
> a1 <- glm( suicides ~ t + monthdays + month, family="poisson", data = U )
Here is the summary of the "small" model:
> summary(a0)
Call:
glm(formula = suicides ~ t + monthdays, family = "poisson", data = U)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7163 -0.6865 -0.1161 0.6363 3.2104
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.8060135 0.3259116 8.610 < 2e-16 ***
t 0.0013650 0.0001443 9.461 < 2e-16 ***
monthdays 0.0418509 0.0106874 3.916 9.01e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 302.67 on 203 degrees of freedom
Residual deviance: 196.64 on 201 degrees of freedom
AIC: 1437.2
Number of Fisher Scoring iterations: 4
You can see that the two variables have largely significant marginal effects. Now look at the larger model:
> summary(a1)
Call:
glm(formula = suicides ~ t + monthdays + month, family = "poisson",
data = U)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.56164 -0.72363 -0.05581 0.58897 3.09423
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.4559200 2.1586699 0.674 0.500
t 0.0013810 0.0001446 9.550 <2e-16 ***
monthdays 0.0869293 0.0719304 1.209 0.227
monthAug -0.0845759 0.0832327 -1.016 0.310
monthDec -0.1094669 0.0833577 -1.313 0.189
monthFeb 0.0657800 0.1331944 0.494 0.621
monthJan -0.0372652 0.0830087 -0.449 0.653
monthJul -0.0125179 0.0828694 -0.151 0.880
monthJun 0.0452746 0.0414287 1.093 0.274
monthMar -0.0638177 0.0831378 -0.768 0.443
monthMay -0.0146418 0.0828840 -0.177 0.860
monthNov -0.0381897 0.0422365 -0.904 0.366
monthOct -0.0463416 0.0830329 -0.558 0.577
monthSep 0.0070567 0.0417829 0.169 0.866
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 302.67 on 203 degrees of freedom
Residual deviance: 182.72 on 190 degrees of freedom
AIC: 1445.3
Number of Fisher Scoring iterations: 4
Well, of course the monthdays
effect vanishes; it can be estimated only thanks to leap years!! Keeping it in the model (and taking into account leap years) allows to use the residual deviances to compare the two models.
> anova(a0, a1, test="Chisq")
Analysis of Deviance Table
Model 1: suicides ~ t + monthdays
Model 2: suicides ~ t + monthdays + month
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 201 196.65
2 190 182.72 11 13.928 0.237
So, no (significant) month effect? But what about a seasonal effect? We can try to capture seasonality using two variables $\cos\left( {2\pi t \over 12}\right)$ and $\sin\left( {2\pi t \over 12}\right)$:
> a2 <- glm( suicides ~ t + monthdays + cos(2*pi*t/12) + sin(2*pi*t/12),
family="poisson", data = U )
> summary(a2)
Call:
glm(formula = suicides ~ t + monthdays + cos(2 * pi * t/12) +
sin(2 * pi * t/12), family = "poisson", data = U)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4782 -0.7095 -0.0544 0.6471 3.2236
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.8676170 0.3381954 8.479 < 2e-16 ***
t 0.0013711 0.0001444 9.493 < 2e-16 ***
monthdays 0.0397990 0.0110877 3.589 0.000331 ***
cos(2 * pi * t/12) -0.0245589 0.0122658 -2.002 0.045261 *
sin(2 * pi * t/12) 0.0172967 0.0121591 1.423 0.154874
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 302.67 on 203 degrees of freedom
Residual deviance: 190.37 on 199 degrees of freedom
AIC: 1434.9
Number of Fisher Scoring iterations: 4
Now compare it with the null model:
> anova(a0, a2, test="Chisq")
Analysis of Deviance Table
Model 1: suicides ~ t + monthdays
Model 2: suicides ~ t + monthdays + cos(2 * pi * t/12) + sin(2 * pi *
t/12)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 201 196.65
2 199 190.38 2 6.2698 0.0435 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So, one can surely say that this suggests a seasonal effect...
Best Answer
Let's start with some considerations:
One usually begins with simple reasonable models, as suggested by theory and restricted by data limitations, and moves to more complex models only if the simpler ones are inadequate. This is how statistical analysis operationalizes the scientific call for parsimony.
Fitting a trend is a form of regression analysis.
Because you have count data, you would naturally first consider binomial regression or Poisson regression. The first is appropriate in any case, while the latter is an excellent approximation for relatively low rates (which is what one hopes with infections!) and is widely available in software. (Ordinary least squares (OLS) is a further approximation that would be valid provided all the annual infection counts are fairly large, say in the tens to hundreds or more, and the infection counts are fairly constant over time.)
When a longish time series of data is available (usually 20-30+ years), you can consider using time series analysis to help account for correlations in rates from year to year. Usually, though, you would first exhaust plausible regression models to account for nonlinear changes over time, perhaps by including quadratic terms, "level shifts," or (more generally) splines. Note that the flexibility to model changes in slope over time is built in to all forms of regression; it is not a special feature of some particular approach.
In any of the regression models you can include separate terms for the male and female trends. This is done by introducing male/female as a covariate by means of "indicator" or "dummy variable" coding and including them as interactions. This has recently been discussed on this site here and here, where you can find the statistical model explicitly stated.
In the extreme case where (a) you contemplate the possibility of all regression coefficients differing between the two groups (the intercept and the slope and the coefficients of any other covariates) and (b) you are using the OLS approximation, this analysis reduces to the Chow Test. The link is to a nice exposition by William Gould, who provides plain-spoken advice ("I blame ... teachers for ... unnecessary jargon") and clear examples. Don't worry that the software is Stata; the output is what matters and it's standard.