Variables Needed for Cox Regression (Survival Analysis)

cox-modelsurvival

I'm preparing a longitudinal dataset (with up to 5 observations per participant) for Cox regression in R. I have data for the follow-up period and date (FUPeriod and FU, respectively), the date of hospital discharge (HospDis) and the date of death (Death; if applicable).

ID FUPeriod HospDis FU Death
1 0 2017-09-26 NA NA
1 1 2017-09-26 2017-11-16 NA
1 2 2017-09-26 2019-02-12 NA
1 5 2017-09-26 2021-09-10 NA
1 10 2017-09-26 NA 2022-02-20

I'm a little stuck on the variables that I need to create from the available temporal data to start my analyses… I know that at the very least I need a censoring/event indicator variable and a survival time variable. My question is whether the survival time and censoring variables need to have values at all time points (i.e., FUPeriod = 0, 1, 2, 5, 10), or whether they need values only for the last available time point (i.e., FUPeriod = 10)? Are these the variables and values I should ultimately have (where SurvTime is survival time since hospital discharge (HospDis) in months and Event = 1 if Death is a valid date and 0 otherwise?

ID FUPeriod HospDis FU Death SurvTime Event
1 0 2017-09-26 NA NA 0 0
1 1 2017-09-26 2017-11-16 NA 1.675565 0
1 2 2017-09-26 2019-02-12 NA 16.55852 0
1 5 2017-09-26 2021-09-10 NA 47.47433 0
1 10 2017-09-26 NA 2022-02-20 52.82957 1

Best Answer

The answer depends on whether you are modeling covariates whose values change over time, and whether you are building a continuous-time or a discrete-time survival model.

If your model only uses covariate values in place at study entry (SurvTime = 0), then the comment by @user352188 is correct. The data format in that case just needs a single row per individual, which includes the covariate values at SurvTime = 0, the last follow-up time, and whether the last follow-up was an event time (as opposed to right-censoring).

If your model includes covariates whose values change over time, you need to have multiple data rows per individual. Each data row then includes for that individual a start time, a stop time, the covariate values in place during that time interval, and an indicator of whether or not the stop time is for an event. That's called the "counting process" data format.

The above assumes that you have a continuous-time model of survival. The vignettes of the R survival package go into more detail, with examples.

Sometimes there are only data at a small number of fixed times, say every year for 5 years (panel data). Such data are analyzed with discrete-time survival methods, a set of binomial regressions at each fixed time. In that case it's simplest to have a separate data row for each time point at which the individual is still at risk.