Without seeing the dataset (not available) it seems mostly correct. The nice thing about Poisson regressions is that they can provide rates when used as suggested. One thing that may be worth to keep in mind is that there may be overdispersion where you should switch to a negative binomial regression (see the MASS package).
The Poisson regression doesn't care whether the data as aggregated or not, but in practice non-aggregated data is frail and can cause some unexpected errors. Note that you cannot have surv == 0
for any of the cases. When I've tested the estimates are the same:
set.seed(1)
n <- 1500
data <-
data.frame(
dead = sample(0:1, n, replace = TRUE, prob = c(.9, .1)),
surv = ceiling(exp(runif(100))*365),
gender = sample(c("Male", "Female"), n, replace = TRUE),
diagnosis = sample(0:1, n, replace = TRUE),
age = sample(60:80, n, replace = TRUE),
inclusion_year = sample(1998:2011, n, replace = TRUE)
)
library(dplyr)
model <-
data %>%
group_by(gender,
diagnosis,
age,
inclusion_year) %>%
summarise(Deaths = sum(dead),
Person_time = sum(surv)) %>%
glm(Deaths ~ gender + diagnosis + I(age - 70) + I(inclusion_year - 1998) + offset(log(Person_time/10^3/365.25)),
data = . , family = poisson)
alt_model <- glm(dead ~ gender + diagnosis + I(age - 70) + I(inclusion_year - 1998) + offset(log(surv/10^3/365.25)),
data = data , family = poisson)
sum(coef(alt_model) - coef(model))
# > 1.779132e-14
sum(abs(confint(alt_model) - confint(model)))
# > 6.013114e-11
As you get a rate it is important to center the variables so that the intercept is interpretable, e.g.:
> exp(coef(model)["(Intercept)"])
(Intercept)
51.3771
Can be interpreted as the base rate and then the covariates are rate ratios. If we want the base rate after 10 years:
> exp(coef(model)["(Intercept)"] + coef(model)["I(inclusion_year - 1998)"]*10)
(Intercept)
47.427
I've currently modeled the inclusion year as a trend variable but you should probably check for nonlinearities and sometimes it is useful to do a categorization of the time points. I used this approach in this article:
D. Gordon, P. Gillgren, S. Eloranta, H. Olsson, M. Gordon, J. Hansson, and K. E. Smedby, “Time trends in incidence of cutaneous melanoma by detailed anatomical location and patterns of ultraviolet radiation exposure: a retrospective population-based study,” Melanoma Res., vol. 25, no. 4, pp. 348–356, Aug. 2015.
You can compare Cox regression models (coxph
) in R
with plrtest
which is partial likelihood ratio test for non-nested coxph
models:
require("survival")
require("nonnestcox") #github.com/thomashielscher/nonnestcox
pbc <- subset(pbc, !is.na(trt))
mod1 <- coxph(Surv(time, status==2) ~ age, data=pbc, x=T)
mod2 <- coxph(Surv(time, status==2) ~ age + albumin + bili + edema + protime, data=pbc, x=T)
mod3 <- coxph(Surv(time, status==2) ~ age + log(albumin) + log(bili) + edema +
log(protime), data=pbc, x=T)
plrtest(mod3, mod2, nested=F) # non-nested models
plrtest(mod3, mod1, nested=T) # nested models
Best Answer
One possible approach would be to predict the median survival time (i.e. the expected time at which the survival for a patient with a specific combination of characteristics would be 0.5, in other words, from that timepoint onwards it is more likely that the patient has died than that he is still alive). This can be done using functions from the
rms
package as follows:A few notes:
I see that you delete patients who died 'etype=2' from the dataset. This could lead to bias. In this case you want to evaluate recurrence, death is actually a competing risk for recurrence which calls for different type of survival analysis. Then it would be advisable to call in expert help.
The predicted survival time depends on the patients in the dataset you used to develop the model. If e.g. the model was build using measurements of patients at the day of their diagnosis (day 0), you should also enter the predictor values of the patients at day 0 and not those at day 2000. Since the example patient you mention has already survived up to day 2000, the model would not be valid for him, as his survival is probably longer. Or you could for example include a predictor variable indicating the time since diagnosis.
As I don't know the details of your research project, I would advise you to consult a statistician to determine a valid approach for your research project.
Hope this helps.