Cox Proportional Hazards – Handling Participants Dropping Out Without Event in Cox Models

cox-modelrsurvivaltime series

In R, I analyse time-to-event data to explore the effect of a biomarker on an event risk. To do this, I work with data that looks like this toy dataset:

> head(data)
   pt sex baseline_age  event  event_days death  stop_days
1   1   M         17082      0       3991     0       3991
2   2   M         25185      0       3491     1       3491
3   3   F         14856      0       3988     0       3988
4   4   F         22046      0       4004     0       4004
5   5   M         23543      1       3924     0       4012

Description of columns:

  • pt: individual's ID
  • sex: gender (M=male, F=female)
  • baseline_age: the age of the individual (in days) at the start of the study
  • event: indicates if yes (1) or no (0) we observe the event
  • event_days: age (in days) at the time of the event
  • death: 0=no, 1=yes
  • stop_days: time (in days) between baseline and death or last news

To fit the cox model separated by gender I used:

library(survival)
coxph(Surv(event_days, event) ~ sex, data = data) %>%
  gtsummary::tbl_regression(exp = TRUE)

The result can be visualized as following:
cox_plot

My question: at the end (~6000 days after the start of the study, see graph), the survival probability is close to 0, but only ~7% of the individuals had an event. I think my model is wrong, and I suspect that this has to do with the event_days and stop_days columns being the same in my dataset when we have no news for an individual. How can I solve this problem?

Best Answer

What you see as a problem is actually the problem solved by this type of survival analysis. Those who drop out of a study before an event do provide information up to their last observation time. Either just omitting them from analysis or treating their last observation time as an event would bias the results.

Thus the risk of an event at any time is estimated by the ratio of the number having an event at that time to the number still at risk at that time. This process starts at the earliest times, pruning the at-risk set as individuals die or drop out, and taking prior survival probabilities into account. It moves on in that way until the last observed event. Unless there is some information included in the times that individuals drop out without an event, this gives a reliable overall estimate of survival since time = 0.

Even though only 7% of your cases were observed to have an event, all of the cases contributed to the analysis up through their last observation times. And they all will have the event of death; you just know it's at some time after you last observed them.

Related Question