I'll give an explanation that is very close to that of Maarten Buis but just a little more elaborate. As always in survival analysis, different time scales can be applied. I think that age is maybe the more intuitive time scale in your setting, so that's where I'll start my answer. Afterwards, I'll try to use that intuition to answer the question.
Let $C_i$ be time of birth. From your data we can easily calculate ages of entering the study,
$$
A_i = t_0 - C_i
$$
and age of exiting the study,
$$
B_i = \min\{T - C_i, D_i\},
$$
where $D_i$ is age at death. Now note, that we have some age interval, $(A_i, B_i]$ where the $i$'th subject is under observation. On this time scale, the study subjects do not enter the study at the same time. Let's denote the minimum of the age at entering the study,
$$
\alpha = \min_i A_i.
$$
What survival information do we have before time $\alpha$? None. This is why we can't say anything about the probability of surviving the age interval $(0, \alpha]$. Necessarily, our Kaplan-Meier estimate must be conditional on survival until age $\alpha$. To give an example: Let's say that $\alpha$ is $1$ year. Would we be able to calculate the survivor function at time $5$ years, $S(5) = P(D > 5)$. Could we calculate how many children would live to see their fifth birthday? No, because we simply don't know how dangerous the first year is. We can calculate only the conditional survivor function $P(D > 5|D > 1)$. Actually, this can again be explained by a change in time scale: there is nothing special about 0, your Kaplan-Meier estimate doesn't have to start at time zero, it can start at some other time, which corresponds to e.g. the time scale defined by age minus $\alpha$. In your data, you write that $\alpha$ is very small as some children are included very young, thus, for $s > \alpha$
$$
P(D > s | D > \alpha) = S(s)/S(\alpha) \simeq S(s)
$$
and actually there is equality in the limit $\alpha \rightarrow 0$ if we assume $S$ to be continuous.
Let's change back to your original time scale, plain calender time. You have no idea how dangerous the time before $t_0$ is, therefore your estimate must be conditional on surviving until $t_0$. This stems from the fact that no children are observed before time $t_0$. On this time scale, it doesn't make much of a difference how close the times of birth are to $t_0$ as we have assumed the same hazard for all ages (instead of an age-specific hazard as above). To sum up, on this time scale (using calender time), the interpretation would of the Kaplan-Meier estimate would be that of (for $t \in (t_0, T]$),
$$
P(X > t | X > t_0).
$$
This is not as intuitive as on the age time scale, however, it just means that when doing a study in calender time, we condition on the subjects having survived the time from birth until the start of our study.
To answer the last part of the question, you do not condition on $T_{first}$ nor on $L_1$, you condition on survival until $t_0$ as this is the minimum of entering times. I think part of the confusion is due to the fact that all the children enter your study at the same time, which is not necessarily the case in all applications, as is evident from using age as the time scale above.
Finally, you could easily say that non-truncation corresponds to the truncation time being smaller than or equal to 0 (or some other natural starting point on a time scale).
The "mathematical reason" is fundamentally the reason why the curves differ. Welcome to the world of left-truncated survival times.
When a case has a start time greater than 0 in the way you formatted the data for sfit2
, that case provides no information about survival prior to that start time. That's considered left truncation.
As you say, those left-truncated cases don't enter the risk set prior to that time. Each drop in the Kaplan-Meier (K-M) curve is determined by the ratio of the number of events at that time to the number of cases at risk. When you diminish the number of cases at risk at early times while keeping the same number of early events, the K-M curve necessarily drops faster at the start. With the product limit form of the K-M estimator, once the curve has dropped you have a lower baseline for the next drop.
Furthermore, the examples you show of left truncation seem not to enter the risk set until the original K-M curve is relatively flat beyond a time of 30, with relatively few later events. So they provide very little information at all, as they are in relatively few risk sets at event times and apparently only after most of the events have occurred, and thus have little influence on the subsequent shape of the curve.
The event = 3
specified in one of your examples evidently represents interval censoring, but you can also have left truncation with a defined event time if you specify event = 1
for the end time. That's the data format used for time-dependent covariates and for other applications of the counting-process data format in survival analysis, like repeated events.
Best Answer
As implied by the tag, this is referred to as interval censoring. In this scenario, for each observation, we have a pair of values that I call observation intervals: $(L_i, R_i]$, where $L_i$ represents a lower bound on the true value of interest and $R_i$ presents an upper bound on the true value of interest. For example, suppose you go to a dentist at age 9 and have no cavities. You go again at age 12 and have one cavity. Then all you know about your "age at first cavity" is that it is in the interval $(9,12]$.
Note: you will need to slightly change your data to put it in this format. For example, you have the interval $(40, 60]$ for subject C but you also state that they are right censored. I assume this means that we know that for subject C, at time 60 the event of interest had not occurred yet. In that case, you should represent their observation interval as $(60, \infty)$.
There exists a variety of tools for analyzing interval censoring data. One of the most basic tools is the non-parametric maximum likelihood estimator (NPMLE). This is basically an extension of the Kaplan Meier curves that allows for interval censoring. For performing hypothesis testing to compare two groups, the log-rank statistics have been generalized to allow for interval censoring. Finally, survival regression models (proportional hazards, accelerated failure time and proportional odds, to name few) can be used.
In
R
, the NPMLE and regression models can be found in myicenReg
package. The log-rank statistic can be found in theinterval
package.