Without seeing the dataset (not available) it seems mostly correct. The nice thing about Poisson regressions is that they can provide rates when used as suggested. One thing that may be worth to keep in mind is that there may be overdispersion where you should switch to a negative binomial regression (see the MASS package).
The Poisson regression doesn't care whether the data as aggregated or not, but in practice non-aggregated data is frail and can cause some unexpected errors. Note that you cannot have surv == 0
for any of the cases. When I've tested the estimates are the same:
set.seed(1)
n <- 1500
data <-
data.frame(
dead = sample(0:1, n, replace = TRUE, prob = c(.9, .1)),
surv = ceiling(exp(runif(100))*365),
gender = sample(c("Male", "Female"), n, replace = TRUE),
diagnosis = sample(0:1, n, replace = TRUE),
age = sample(60:80, n, replace = TRUE),
inclusion_year = sample(1998:2011, n, replace = TRUE)
)
library(dplyr)
model <-
data %>%
group_by(gender,
diagnosis,
age,
inclusion_year) %>%
summarise(Deaths = sum(dead),
Person_time = sum(surv)) %>%
glm(Deaths ~ gender + diagnosis + I(age - 70) + I(inclusion_year - 1998) + offset(log(Person_time/10^3/365.25)),
data = . , family = poisson)
alt_model <- glm(dead ~ gender + diagnosis + I(age - 70) + I(inclusion_year - 1998) + offset(log(surv/10^3/365.25)),
data = data , family = poisson)
sum(coef(alt_model) - coef(model))
# > 1.779132e-14
sum(abs(confint(alt_model) - confint(model)))
# > 6.013114e-11
As you get a rate it is important to center the variables so that the intercept is interpretable, e.g.:
> exp(coef(model)["(Intercept)"])
(Intercept)
51.3771
Can be interpreted as the base rate and then the covariates are rate ratios. If we want the base rate after 10 years:
> exp(coef(model)["(Intercept)"] + coef(model)["I(inclusion_year - 1998)"]*10)
(Intercept)
47.427
I've currently modeled the inclusion year as a trend variable but you should probably check for nonlinearities and sometimes it is useful to do a categorization of the time points. I used this approach in this article:
D. Gordon, P. Gillgren, S. Eloranta, H. Olsson, M. Gordon, J. Hansson, and K. E. Smedby, “Time trends in incidence of cutaneous melanoma by detailed anatomical location and patterns of ultraviolet radiation exposure: a retrospective population-based study,” Melanoma Res., vol. 25, no. 4, pp. 348–356, Aug. 2015.
You can compare Cox regression models (coxph
) in R
with plrtest
which is partial likelihood ratio test for non-nested coxph
models:
require("survival")
require("nonnestcox") #github.com/thomashielscher/nonnestcox
pbc <- subset(pbc, !is.na(trt))
mod1 <- coxph(Surv(time, status==2) ~ age, data=pbc, x=T)
mod2 <- coxph(Surv(time, status==2) ~ age + albumin + bili + edema + protime, data=pbc, x=T)
mod3 <- coxph(Surv(time, status==2) ~ age + log(albumin) + log(bili) + edema +
log(protime), data=pbc, x=T)
plrtest(mod3, mod2, nested=F) # non-nested models
plrtest(mod3, mod1, nested=T) # nested models
Best Answer
The logistic regression model assumes the response is a Bernoulli trial (or more generally a binomial, but for simplicity, we'll keep it 0-1). A survival model assumes the response is typically a time to event (again, there are generalizations of this that we'll skip). Another way to put that is that units are passing through a series of values until an event occurs. It isn't that a coin is actually discretely flipped at each point. (That could happen, of course, but then you would need a model for repeated measures—perhaps a GLMM.)
Your logistic regression model takes each death as a coin flip that occurred at that age and came up tails. Likewise, it considers each censored datum as a single coin flip that occurred at the specified age and came up heads. The problem here is that that is inconsistent with what the data really are.
Here are some plots of the data, and the output of the models. (Note that I flip the predictions from the logistic regression model to predicting being alive so that the line matches the conditional density plot.)
It may be helpful to consider a situation in which the data were appropriate for either a survival analysis or a logistic regression. Imagine a study to determine the probability that a patient will be readmitted to the hospital within 30 days of discharge under a new protocol or standard of care. However, all patients are followed to readmission, and there is no censoring (this isn't terribly realistic), so the exact time to readmission could be analyzed with survival analysis (viz., a Cox proportional hazards model here). To simulate this situation, I'll use exponential distributions with rates .5 and 1, and use the value 1 as a cutoff to represent 30 days:
In this case, we see that the p-value from the logistic regression model (
0.163
) was higher than the p-value from a survival analysis (0.005
). To explore this idea further, we can extend the simulation to estimate the power of a logistic regression analysis vs. a survival analysis, and the probability that the p-value from the Cox model will be lower than the p-value from the logistic regression. I'll also use 1.4 as the threshold, so that I don't disadvantage the logistic regression by using a suboptimal cutoff:So the power of the logistic regression is lower (about 75%) than the survival analysis (about 93%), and 90% of the p-values from the survival analysis were lower than the corresponding p-values from the logistic regression. Taking the lag times into account, instead of just less than or greater than some threshold does yield more statistical power as you had intuited.