Solved – Modelling mortality rates using Poisson regression

multiple regressionpoisson-regressionregressionsurvival

I'm examining trends (between 1998 and 2011) in mortality rates among patients with Crohn's disease. Each patient (case) have been included during 1998 to 2011. At inclusion, each patient have been matched to a healthy control with the same age and sex. I'm analysing trends in mortality rates. When doing this directly, without any adjustments, I obtain fluctuating mortality rates over time, which is presumably due to the fact that individuals included a given year wont be comparable to those included another year. Thus I aim to adjust the mortality rates. I expect that mortality rates in both groups (cases and controls) will decline over time and the gap between cases and controls will narrow successively.

My idea is to do the adjustment by means of Poisson regression. My data is on individual level. I wish to obtain one estimate on the incidence rate (per 1000 person-years) for cases and controls each year from 1998 to 2011. Survival time would be included as the offset in the model. Something similar has been done here.

I've attached the 200 first rows from my data set, which consists of 1500 individuals. Here is the data. Variable explanation:

  • dead = if patient died or not during follow-up
  • surv = survival time in days
  • agegroup = categorized age group (4 groups)
  • gender = male/female
  • diagnosis = 0 for healthy control, 1 for Crohns disease
  • age = age in years
  • inclusion_year = year of inclusion in the study

What did I try so far? I've tried to fit Poisson models with the glm() function in R, using individual observations (log(surv) as the offset), but I either received an error or could not figure out how to use the fits. I've also aggregated the data into groups, and then analysed the death counts in glm(); when I used the fit to obtain incidence rates I could only obtain rates for a specific age/agegroup and sex (as needed to be specified in the predict() function).

I'd really appreciate some statistical advice and coding examples, which can be done on the attached data set.

Best Answer

Without seeing the dataset (not available) it seems mostly correct. The nice thing about Poisson regressions is that they can provide rates when used as suggested. One thing that may be worth to keep in mind is that there may be overdispersion where you should switch to a negative binomial regression (see the MASS package).

The Poisson regression doesn't care whether the data as aggregated or not, but in practice non-aggregated data is frail and can cause some unexpected errors. Note that you cannot have surv == 0 for any of the cases. When I've tested the estimates are the same:

set.seed(1)
n <- 1500
data <- 
  data.frame(
    dead = sample(0:1, n, replace = TRUE, prob = c(.9, .1)),
    surv = ceiling(exp(runif(100))*365),
    gender = sample(c("Male", "Female"), n, replace = TRUE),
    diagnosis = sample(0:1, n, replace = TRUE),
    age = sample(60:80, n, replace = TRUE),
    inclusion_year = sample(1998:2011, n, replace = TRUE)
  )

library(dplyr)
model <- 
  data %>% 
  group_by(gender, 
           diagnosis,
           age,
           inclusion_year) %>% 
  summarise(Deaths = sum(dead),
            Person_time = sum(surv)) %>%
  glm(Deaths ~ gender + diagnosis + I(age - 70) + I(inclusion_year - 1998) + offset(log(Person_time/10^3/365.25)), 
      data = . , family = poisson)

alt_model <- glm(dead ~ gender + diagnosis + I(age - 70) + I(inclusion_year - 1998) + offset(log(surv/10^3/365.25)), 
    data = data , family = poisson)
sum(coef(alt_model) - coef(model))
# > 1.779132e-14
sum(abs(confint(alt_model) - confint(model)))
# > 6.013114e-11

As you get a rate it is important to center the variables so that the intercept is interpretable, e.g.:

> exp(coef(model)["(Intercept)"])
(Intercept) 
    51.3771

Can be interpreted as the base rate and then the covariates are rate ratios. If we want the base rate after 10 years:

> exp(coef(model)["(Intercept)"] + coef(model)["I(inclusion_year - 1998)"]*10)
(Intercept) 
     47.427 

I've currently modeled the inclusion year as a trend variable but you should probably check for nonlinearities and sometimes it is useful to do a categorization of the time points. I used this approach in this article:

D. Gordon, P. Gillgren, S. Eloranta, H. Olsson, M. Gordon, J. Hansson, and K. E. Smedby, “Time trends in incidence of cutaneous melanoma by detailed anatomical location and patterns of ultraviolet radiation exposure: a retrospective population-based study,” Melanoma Res., vol. 25, no. 4, pp. 348–356, Aug. 2015.

Related Question