Survival Curves – How to Use Surv and Survfit

censoringinterval-censoringkaplan-meiersurvival

I have fit a simple KM curve using the Surv and survfit functions in R. The first 6 rows of the data are shown below alongside the code used to obtain the KM curves.

  starttime    futime failure 
1   0.00000  7.720739       1
2   0.00000  9.396304       1
3   0.00000 65.149897       0
4   0.00000 70.209446       0
5   0.00000 10.710472       1
6   0.00000 67.055441       0

sfit1 <- survfit(Surv(starttime,futime, failure) ~ 1, tdata1)
################################################################
#This is equivalent to 
sfit1 <- survfit(Surv(futime, failure) ~ 1, tdata1)

The KM curves are shown in black in the attached Figure. KM Curves

I then adapted the dataset so that the starting times of the right censored units were nonzero. The first 6 rows of the new dataset are shown below. The fitted KM curves are shown in blue in the same plot.

  starttime    futime failure
1   0.00000  7.720739       1
2   0.00000  9.396304       1
3  32.57495 65.149897       0
4  35.10472 70.209446       0
5   0.00000 10.710472       1
6  33.52772 67.055441       0

sfit2 <- survfit(Surv(starttime,futime, failure) ~ 1, tdata2) 

I do not understand why the curves are different. I understand that if I changed the starting times of the units that failed, for example

    starttime futime failure
1          3       7       3

this would mean that the first unit failed between weeks 3 and 7 (interval censored). If this were the case I would expect the new data to produce a different curve to the original curve (black). However, I changed the starting times of the units that did not experience any event.

In other words

           starttime futime failure
    3           32.6   65.2       0

is saying that unit 3 was working at 65.2 weeks, when the study ended. Therefore, this unit is right censored. The additional information that this unit was working at week 32.6 is useless (perhaps I am wrong).

I would understand the curve changing if some units had failed prior to the starting time, i.e. left censored data. However, I do not have any left censored data. I simply have units that fail at a known time, and right censored units (i.e. Table 1).

What additional information is Table 2 giving compared to Table 1? I understand that unit 3 (in Table 2) entered the study at the age of 32.6 weeks, but so what? This unit was new (age 0) at some point, it just so happens that we started recording this units age at 32.6 weeks.

Mathematically, using the formula to calculate the KM estimates, unit 3, which entered the study at 32.6 weeks, would not be included in the risk set when calculating survival probabilities for times earlier than 32.6 weeks. Therefore, mathematically, I understand the difference. But, in the context of my question, I do not see why the curves should be different.

Best Answer

The "mathematical reason" is fundamentally the reason why the curves differ. Welcome to the world of left-truncated survival times.

When a case has a start time greater than 0 in the way you formatted the data for sfit2, that case provides no information about survival prior to that start time. That's considered left truncation.

As you say, those left-truncated cases don't enter the risk set prior to that time. Each drop in the Kaplan-Meier (K-M) curve is determined by the ratio of the number of events at that time to the number of cases at risk. When you diminish the number of cases at risk at early times while keeping the same number of early events, the K-M curve necessarily drops faster at the start. With the product limit form of the K-M estimator, once the curve has dropped you have a lower baseline for the next drop.

Furthermore, the examples you show of left truncation seem not to enter the risk set until the original K-M curve is relatively flat beyond a time of 30, with relatively few later events. So they provide very little information at all, as they are in relatively few risk sets at event times and apparently only after most of the events have occurred, and thus have little influence on the subsequent shape of the curve.

The event = 3 specified in one of your examples evidently represents interval censoring, but you can also have left truncation with a defined event time if you specify event = 1 for the end time. That's the data format used for time-dependent covariates and for other applications of the counting-process data format in survival analysis, like repeated events.

Related Question