Poisson Model – How to Calculate Incidence Rates and Relate to Hazard Ratio from Cox PH Model

cox-modelincidence-rate-ratiopoisson-regressionrsurvival

I want to calculate incidence rates to present along hazard ratio's in order to present both relative and absolute measures of risk. I saw in other studies that such incidence rates can be calculated using poisson models with follow-up time in the model as an offset. So I tried that in R as follows :

library(survival)

# Get example data
data(colon)
colon$status <- ifelse(colon$etype==1,0,1) # set to 0/1 (needed for poisson later on)

# Fit cox model for rx (age + sex adjusted)
coxph(Surv(time,status)~rx+sex+age, data=colon)
# HR (rxLev): 0.92  
# HR (rxLev+5FU): 0.74

# Get incidence rates using poisson models with same terms and log(time) as offset
mod <- glm(status~offset(log(time))+rx+sex+age, data=colon, family=poisson)

# Get rates using predict-function
Obs <- predict(mod, data.frame(time=1, rx="Obs", age=mean(colon$age),
                                   sex=mean(colon$sex)),  type="response")
Lev <- predict(mod, data.frame(time=1, rx="Lev", age=mean(colon$age), 
                                   sex=mean(colon$sex)),  type="response")
Lev5FU <- predict(mod, data.frame(time=1, rx="Lev+5FU", age=mean(colon$age), 
                                      sex=mean(colon$sex)),  type="response")

# Calculate incidence rate ratio's:
Lev/Obs # 0.98
Lev5FU/Obs # 0.84

I would expect that the incidence rate ratio's are similar to the hazard ratio's from the Cox PH model with the same terms, but somehow they differ. Am I using the correct approach to calculate incidence rates?

Any help would be greatly appreciated!

Best Answer

As far as I can see there's nothing wrong with your code or calculations. You could skip a few lines of code, though, by getting the incidence rate ratios by ${\tt exp(coef(mod))}$. The two models make different assumptions, and this potentially leads to different results.

Poisson regression assumes constant hazards. The Cox model only assumes that the hazards are proportional. If the assumption of constant hazards is fulfilled this question

Does Cox Regression have an underlying Poisson distribution?

explains the connection between Cox and Poisson regression.

We can use simulation to study two situations: constant hazards and non-constant (but proportional) hazards. First let's simulate data from a population with a constant hazard. The hazard ratio has the form

$\lambda(t) = \lambda_0\exp(\beta'x)$,

where $\beta$ is a vector of parameters, $x$ is a vector of covariates and $\lambda_0$ is some fixed positive number. Thus, the constant hazard assumption of the Poisson regression is fulfilled. Now we simulate from this model by using the fact (found in many books, e.g. Modeling Survival Data by Therneau, p.13) that the distribution function, $F$, of the survival time with $\lambda$ as hazard can be found as

$F(t) = 1 - \exp\left(\int_0^t \lambda(s)\text{ d}s\right)$.

With this we can also find the inverse of $F$, $F^{-1}$. With this function we simulate survival times with the correct hazard by drawing variables that are uniform on $(0,1)$ and transforming them using $F^{-1}$. Let's do it.

library(survival)
data(colon)
data <- with(colon, data.frame(sex = sex, rx = rx, age = age))
n <- dim(data)[1]
# defining linP, the linear predictor, beta*x in the above notation
linP <- with(colon, log(0.05) + c(0.05, 0.01)[as.factor(sex)] + c(0.01,0.05,0.1)[rx] + 0.1*age)

h <- exp(linP)

simFuncC <- function() {
  cens <- runif(n) # simulating censoring times
  toe <- -log(runif(n))/h # simulating times of events
  event <- ifelse(toe <= cens, 1, 0) # deciding if time of event or censoring is the smallest
  data$time <- pmin(toe, cens)
  data$event <- event
  mCox <- coxph(Surv(time, event) ~ sex + rx + age, data = data)
  mPois <- glm(event ~ sex + rx + age, data = data, offset = log(time))
  c(coef(mCox), coef(mPois))
}

sim <- t(replicate(1000, simFuncC()))
colMeans(sim)

For the Cox model the averages of the parameter estimates are

        sex       rxLev   rxLev+5FU         age 
-0.03826301  0.04167353  0.09069553  0.10025534 

and for the Poisson model

(Intercept)         sex       rxLev   rxLev+5FU         age 
-1.23651275 -0.03822161  0.03678366  0.08606452  0.09812454 

For both models, we see that this is close to the true values, remembering that the difference between men and women was -0.04, for instance, and it's estimated to be -0.038 for both models. We can now do the same with the non-constant hazard function

$\lambda(t) = \lambda_0 t \exp(\beta'x)$.

We now simulate as before.

simFuncN <- function() {
  cens <- runif(n)
  toe <- sqrt(-log(runif(n))/h)
  event <- ifelse(toe <= cens, 1, 0)
  data$time <- pmin(toe, cens)
  data$event <- event
  mCox <- coxph(Surv(time, event) ~ sex + rx + age, data = data)
  mPois <- glm(event ~ sex + rx + age, data = data, offset = log(time))
  c(coef(mCox), coef(mPois))
}

sim <- t(replicate(1000, simFuncN()))
colMeans(sim)

For the Cox model we now get

        sex       rxLev   rxLev+5FU         age 
-0.04220381  0.04497241  0.09163522  0.10029121  

and for the Poisson model

(Intercept)         sex       rxLev   rxLev+5FU         age 
-0.12001361 -0.01937333  0.02028097  0.04318946  0.04908300

In this simulation, the averages of the Poisson model is clearly further from the true values than those of the Cox model. This is not surprising as we have violated the assumption of constant hazards.

When the hazard is constant, the survivor function, $S$, is of the form

$S(t) = \exp(-\alpha * t) $,

for some positive $\alpha$ dependent on the specific subject, thus $S$ is convex. If we use the Kaplan-Meier estimator to get an estimate of $S$ for the original data, we see the following.

enter image description here

This function looks concave. This doesn't prove anything, but it could be a hint that the assumption of constant hazards is not fulfilled for this data set, which in turn could explain the discrepancies between the two models.

A final remark on data, as far as I know ${\tt colon}$ holds data on both time to recurrence of cancer and time to death (there's two observations for each value of ${\tt id}$). In the above, we've been modeling it like it was just the same thing. That's probably not a good idea.

Related Question