In R, I analyse time-to-event data to explore the effect of a biomarker on an event risk. To do this, I work with data that looks like this toy dataset:
> head(data)
pt sex baseline_age event event_days death stop_days
1 1 M 17082 0 3991 0 3991
2 2 M 25185 0 3491 1 3491
3 3 F 14856 0 3988 0 3988
4 4 F 22046 0 4004 0 4004
5 5 M 23543 1 3924 0 4012
Description of columns:
- pt: individual's ID
- sex: gender (M=male, F=female)
- baseline_age: the age of the individual (in days) at the start of the study
- event: indicates if yes (1) or no (0) we observe the event
- event_days: age (in days) at the time of the event
- death: 0=no, 1=yes
- stop_days: time (in days) between baseline and death or last news
To fit the cox model separated by gender I used:
library(survival)
coxph(Surv(event_days, event) ~ sex, data = data) %>%
gtsummary::tbl_regression(exp = TRUE)
The result can be visualized as following:
My question: at the end (~6000 days after the start of the study, see graph), the survival probability is close to 0, but only ~7% of the individuals had an event. I think my model is wrong, and I suspect that this has to do with the event_days
and stop_days
columns being the same in my dataset when we have no news for an individual. How can I solve this problem?
Best Answer
What you see as a problem is actually the problem solved by this type of survival analysis. Those who drop out of a study before an event do provide information up to their last observation time. Either just omitting them from analysis or treating their last observation time as an event would bias the results.
Thus the risk of an event at any time is estimated by the ratio of the number having an event at that time to the number still at risk at that time. This process starts at the earliest times, pruning the at-risk set as individuals die or drop out, and taking prior survival probabilities into account. It moves on in that way until the last observed event. Unless there is some information included in the times that individuals drop out without an event, this gives a reliable overall estimate of survival since
time = 0
.Even though only 7% of your cases were observed to have an event, all of the cases contributed to the analysis up through their last observation times. And they all will have the event of death; you just know it's at some time after you last observed them.