A couple suggestions, not directly related to CoxPH but to interactions and collinearity
1) When you are getting "crazy" values like these, one possiblitiy is collinearity. This is often a problem when you have interactions. Have you centered all your variables (by subtracting the mean from each)?
2) You can't interpret one interaction among many quite so easily. LT, food and temp2 are all involved in many interactions. So, look at predicted values from different combinations.
3) Check the units of the different variables. When you get crazy parameters, sometimes it's a problem of units (e.g. measuring a human height in millimeters or kilometers)
4) Once you've got that stuff straightened out, I find the easiest way to think of the effects of different interactions (esp. higher level ones) is to graph the predicted values with different combinations of the independent values.
Without seeing the dataset (not available) it seems mostly correct. The nice thing about Poisson regressions is that they can provide rates when used as suggested. One thing that may be worth to keep in mind is that there may be overdispersion where you should switch to a negative binomial regression (see the MASS package).
The Poisson regression doesn't care whether the data as aggregated or not, but in practice non-aggregated data is frail and can cause some unexpected errors. Note that you cannot have surv == 0
for any of the cases. When I've tested the estimates are the same:
set.seed(1)
n <- 1500
data <-
data.frame(
dead = sample(0:1, n, replace = TRUE, prob = c(.9, .1)),
surv = ceiling(exp(runif(100))*365),
gender = sample(c("Male", "Female"), n, replace = TRUE),
diagnosis = sample(0:1, n, replace = TRUE),
age = sample(60:80, n, replace = TRUE),
inclusion_year = sample(1998:2011, n, replace = TRUE)
)
library(dplyr)
model <-
data %>%
group_by(gender,
diagnosis,
age,
inclusion_year) %>%
summarise(Deaths = sum(dead),
Person_time = sum(surv)) %>%
glm(Deaths ~ gender + diagnosis + I(age - 70) + I(inclusion_year - 1998) + offset(log(Person_time/10^3/365.25)),
data = . , family = poisson)
alt_model <- glm(dead ~ gender + diagnosis + I(age - 70) + I(inclusion_year - 1998) + offset(log(surv/10^3/365.25)),
data = data , family = poisson)
sum(coef(alt_model) - coef(model))
# > 1.779132e-14
sum(abs(confint(alt_model) - confint(model)))
# > 6.013114e-11
As you get a rate it is important to center the variables so that the intercept is interpretable, e.g.:
> exp(coef(model)["(Intercept)"])
(Intercept)
51.3771
Can be interpreted as the base rate and then the covariates are rate ratios. If we want the base rate after 10 years:
> exp(coef(model)["(Intercept)"] + coef(model)["I(inclusion_year - 1998)"]*10)
(Intercept)
47.427
I've currently modeled the inclusion year as a trend variable but you should probably check for nonlinearities and sometimes it is useful to do a categorization of the time points. I used this approach in this article:
D. Gordon, P. Gillgren, S. Eloranta, H. Olsson, M. Gordon, J. Hansson, and K. E. Smedby, “Time trends in incidence of cutaneous melanoma by detailed anatomical location and patterns of ultraviolet radiation exposure: a retrospective population-based study,” Melanoma Res., vol. 25, no. 4, pp. 348–356, Aug. 2015.
Best Answer
As far as I can see there's nothing wrong with your code or calculations. You could skip a few lines of code, though, by getting the incidence rate ratios by ${\tt exp(coef(mod))}$. The two models make different assumptions, and this potentially leads to different results.
Poisson regression assumes constant hazards. The Cox model only assumes that the hazards are proportional. If the assumption of constant hazards is fulfilled this question
Does Cox Regression have an underlying Poisson distribution?
explains the connection between Cox and Poisson regression.
We can use simulation to study two situations: constant hazards and non-constant (but proportional) hazards. First let's simulate data from a population with a constant hazard. The hazard ratio has the form
$\lambda(t) = \lambda_0\exp(\beta'x)$,
where $\beta$ is a vector of parameters, $x$ is a vector of covariates and $\lambda_0$ is some fixed positive number. Thus, the constant hazard assumption of the Poisson regression is fulfilled. Now we simulate from this model by using the fact (found in many books, e.g. Modeling Survival Data by Therneau, p.13) that the distribution function, $F$, of the survival time with $\lambda$ as hazard can be found as
$F(t) = 1 - \exp\left(\int_0^t \lambda(s)\text{ d}s\right)$.
With this we can also find the inverse of $F$, $F^{-1}$. With this function we simulate survival times with the correct hazard by drawing variables that are uniform on $(0,1)$ and transforming them using $F^{-1}$. Let's do it.
For the Cox model the averages of the parameter estimates are
and for the Poisson model
For both models, we see that this is close to the true values, remembering that the difference between men and women was -0.04, for instance, and it's estimated to be -0.038 for both models. We can now do the same with the non-constant hazard function
$\lambda(t) = \lambda_0 t \exp(\beta'x)$.
We now simulate as before.
For the Cox model we now get
and for the Poisson model
In this simulation, the averages of the Poisson model is clearly further from the true values than those of the Cox model. This is not surprising as we have violated the assumption of constant hazards.
When the hazard is constant, the survivor function, $S$, is of the form
$S(t) = \exp(-\alpha * t) $,
for some positive $\alpha$ dependent on the specific subject, thus $S$ is convex. If we use the Kaplan-Meier estimator to get an estimate of $S$ for the original data, we see the following.
This function looks concave. This doesn't prove anything, but it could be a hint that the assumption of constant hazards is not fulfilled for this data set, which in turn could explain the discrepancies between the two models.
A final remark on data, as far as I know ${\tt colon}$ holds data on both time to recurrence of cancer and time to death (there's two observations for each value of ${\tt id}$). In the above, we've been modeling it like it was just the same thing. That's probably not a good idea.