Poisson Model – How to Calculate Incidence Rates and Relate to Hazard Ratio from Cox PH Model

cox-modelincidence-rate-ratiopoisson-regressionrsurvival

I want to calculate incidence rates to present along hazard ratio's in order to present both relative and absolute measures of risk. I saw in other studies that such incidence rates can be calculated using poisson models with follow-up time in the model as an offset. So I tried that in R as follows :

library(survival)

# Get example data
data(colon)
colon$status <- ifelse(colon$etype==1,0,1) # set to 0/1 (needed for poisson later on)

# Fit cox model for rx (age + sex adjusted)
coxph(Surv(time,status)~rx+sex+age, data=colon)
# HR (rxLev): 0.92  
# HR (rxLev+5FU): 0.74

# Get incidence rates using poisson models with same terms and log(time) as offset
mod <- glm(status~offset(log(time))+rx+sex+age, data=colon, family=poisson)

# Get rates using predict-function
Obs <- predict(mod, data.frame(time=1, rx="Obs", age=mean(colon$age),
                                   sex=mean(colon$sex)),  type="response")
Lev <- predict(mod, data.frame(time=1, rx="Lev", age=mean(colon$age), 
                                   sex=mean(colon$sex)),  type="response")
Lev5FU <- predict(mod, data.frame(time=1, rx="Lev+5FU", age=mean(colon$age), 
                                      sex=mean(colon$sex)),  type="response")

# Calculate incidence rate ratio's:
Lev/Obs # 0.98
Lev5FU/Obs # 0.84

I would expect that the incidence rate ratio's are similar to the hazard ratio's from the Cox PH model with the same terms, but somehow they differ. Am I using the correct approach to calculate incidence rates?

Any help would be greatly appreciated!

Best Answer

As far as I can see there's nothing wrong with your code or calculations. You could skip a few lines of code, though, by getting the incidence rate ratios by ${\tt exp(coef(mod))}$. The two models make different assumptions, and this potentially leads to different results.

Poisson regression assumes constant hazards. The Cox model only assumes that the hazards are proportional. If the assumption of constant hazards is fulfilled this question

Does Cox Regression have an underlying Poisson distribution?

explains the connection between Cox and Poisson regression.

We can use simulation to study two situations: constant hazards and non-constant (but proportional) hazards. First let's simulate data from a population with a constant hazard. The hazard ratio has the form

$\lambda(t) = \lambda_0\exp(\beta'x)$,

where $\beta$ is a vector of parameters, $x$ is a vector of covariates and $\lambda_0$ is some fixed positive number. Thus, the constant hazard assumption of the Poisson regression is fulfilled. Now we simulate from this model by using the fact (found in many books, e.g. Modeling Survival Data by Therneau, p.13) that the distribution function, $F$, of the survival time with $\lambda$ as hazard can be found as

$F(t) = 1 - \exp\left(\int_0^t \lambda(s)\text{ d}s\right)$.

With this we can also find the inverse of $F$, $F^{-1}$. With this function we simulate survival times with the correct hazard by drawing variables that are uniform on $(0,1)$ and transforming them using $F^{-1}$. Let's do it.

library(survival)
data(colon)
data <- with(colon, data.frame(sex = sex, rx = rx, age = age))
n <- dim(data)[1]
# defining linP, the linear predictor, beta*x in the above notation
linP <- with(colon, log(0.05) + c(0.05, 0.01)[as.factor(sex)] + c(0.01,0.05,0.1)[rx] + 0.1*age)

h <- exp(linP)

simFuncC <- function() {
  cens <- runif(n) # simulating censoring times
  toe <- -log(runif(n))/h # simulating times of events
  event <- ifelse(toe <= cens, 1, 0) # deciding if time of event or censoring is the smallest
  data$time <- pmin(toe, cens)
  data$event <- event
  mCox <- coxph(Surv(time, event) ~ sex + rx + age, data = data)
  mPois <- glm(event ~ sex + rx + age, data = data, offset = log(time))
  c(coef(mCox), coef(mPois))
}

sim <- t(replicate(1000, simFuncC()))
colMeans(sim)

For the Cox model the averages of the parameter estimates are

        sex       rxLev   rxLev+5FU         age 
-0.03826301  0.04167353  0.09069553  0.10025534

and for the Poisson model

(Intercept)         sex       rxLev   rxLev+5FU         age 
-1.23651275 -0.03822161  0.03678366  0.08606452  0.09812454

For both models, we see that this is close to the true values, remembering that the difference between men and women was -0.04, for instance, and it's estimated to be -0.038 for both models. We can now do the same with the non-constant hazard function

$\lambda(t) = \lambda_0 t \exp(\beta'x)$.

We now simulate as before.

simFuncN <- function() {
  cens <- runif(n)
  toe <- sqrt(-log(runif(n))/h)
  event <- ifelse(toe <= cens, 1, 0)
  data$time <- pmin(toe, cens)
  data$event <- event
  mCox <- coxph(Surv(time, event) ~ sex + rx + age, data = data)
  mPois <- glm(event ~ sex + rx + age, data = data, offset = log(time))
  c(coef(mCox), coef(mPois))
}

sim <- t(replicate(1000, simFuncN()))
colMeans(sim)

For the Cox model we now get

        sex       rxLev   rxLev+5FU         age 
-0.04220381  0.04497241  0.09163522  0.10029121

and for the Poisson model

(Intercept)         sex       rxLev   rxLev+5FU         age 
-0.12001361 -0.01937333  0.02028097  0.04318946  0.04908300

In this simulation, the averages of the Poisson model is clearly further from the true values than those of the Cox model. This is not surprising as we have violated the assumption of constant hazards.

When the hazard is constant, the survivor function, $S$, is of the form

$S(t) = \exp(-\alpha * t) $,

for some positive $\alpha$ dependent on the specific subject, thus $S$ is convex. If we use the Kaplan-Meier estimator to get an estimate of $S$ for the original data, we see the following.

enter image description here

This function looks concave. This doesn't prove anything, but it could be a hint that the assumption of constant hazards is not fulfilled for this data set, which in turn could explain the discrepancies between the two models.

A final remark on data, as far as I know ${\tt colon}$ holds data on both time to recurrence of cancer and time to death (there's two observations for each value of ${\tt id}$). In the above, we've been modeling it like it was just the same thing. That's probably not a good idea.

Related Solutions

Solved – Cox proportional hazard model and interpretation of coefficients when higher case interaction is involved

A couple suggestions, not directly related to CoxPH but to interactions and collinearity

1) When you are getting "crazy" values like these, one possiblitiy is collinearity. This is often a problem when you have interactions. Have you centered all your variables (by subtracting the mean from each)?

2) You can't interpret one interaction among many quite so easily. LT, food and temp2 are all involved in many interactions. So, look at predicted values from different combinations.

3) Check the units of the different variables. When you get crazy parameters, sometimes it's a problem of units (e.g. measuring a human height in millimeters or kilometers)

4) Once you've got that stuff straightened out, I find the easiest way to think of the effects of different interactions (esp. higher level ones) is to graph the predicted values with different combinations of the independent values.

Solved – Modelling mortality rates using Poisson regression

Without seeing the dataset (not available) it seems mostly correct. The nice thing about Poisson regressions is that they can provide rates when used as suggested. One thing that may be worth to keep in mind is that there may be overdispersion where you should switch to a negative binomial regression (see the MASS package).

The Poisson regression doesn't care whether the data as aggregated or not, but in practice non-aggregated data is frail and can cause some unexpected errors. Note that you cannot have surv == 0 for any of the cases. When I've tested the estimates are the same:

set.seed(1)
n <- 1500
data <- 
  data.frame(
    dead = sample(0:1, n, replace = TRUE, prob = c(.9, .1)),
    surv = ceiling(exp(runif(100))*365),
    gender = sample(c("Male", "Female"), n, replace = TRUE),
    diagnosis = sample(0:1, n, replace = TRUE),
    age = sample(60:80, n, replace = TRUE),
    inclusion_year = sample(1998:2011, n, replace = TRUE)
  )

library(dplyr)
model <- 
  data %>% 
  group_by(gender, 
           diagnosis,
           age,
           inclusion_year) %>% 
  summarise(Deaths = sum(dead),
            Person_time = sum(surv)) %>%
  glm(Deaths ~ gender + diagnosis + I(age - 70) + I(inclusion_year - 1998) + offset(log(Person_time/10^3/365.25)), 
      data = . , family = poisson)

alt_model <- glm(dead ~ gender + diagnosis + I(age - 70) + I(inclusion_year - 1998) + offset(log(surv/10^3/365.25)), 
    data = data , family = poisson)
sum(coef(alt_model) - coef(model))
# > 1.779132e-14
sum(abs(confint(alt_model) - confint(model)))
# > 6.013114e-11

As you get a rate it is important to center the variables so that the intercept is interpretable, e.g.:

> exp(coef(model)["(Intercept)"])
(Intercept) 
    51.3771

Can be interpreted as the base rate and then the covariates are rate ratios. If we want the base rate after 10 years:

> exp(coef(model)["(Intercept)"] + coef(model)["I(inclusion_year - 1998)"]*10)
(Intercept) 
     47.427

I've currently modeled the inclusion year as a trend variable but you should probably check for nonlinearities and sometimes it is useful to do a categorization of the time points. I used this approach in this article:

D. Gordon, P. Gillgren, S. Eloranta, H. Olsson, M. Gordon, J. Hansson, and K. E. Smedby, “Time trends in incidence of cutaneous melanoma by detailed anatomical location and patterns of ultraviolet radiation exposure: a retrospective population-based study,” Melanoma Res., vol. 25, no. 4, pp. 348–356, Aug. 2015.

Best Answer

Related Solutions

Solved – Cox proportional hazard model and interpretation of coefficients when higher case interaction is involved

Solved – Modelling mortality rates using Poisson regression

Related Question