Survival Analysis – Understanding Censoring, Truncated, and Missing Data

censoringmissing datasurvivaltruncation

Can someone please explain the difference between censored, truncated and missing data in survival analysis? Suppose I have following information.

   user_ID    survival_time(y) age(x1) tumor_size(x2)
      1        10               2         5
      2        3                3         4
      3        _                _         3
      4        _                _         _

I am concerned about user 3 and 4. Are those subjected considered as censored, truncated or missing? If I exclude them in the study does it lead to biased results when modeling survival time?

Best Answer

It's probably best to think about censored, truncated and missing data in terms of data values rather than in terms of individuals, as you seem to be doing.

Censored data are observations for which you only know some range within which the value lies. In survival analysis, that's typically when you've followed up an individual for a period of time but the event (e.g., death) hasn't yet occurred. Then the follow-up time is a lower limit to the survival_time, a right-censored observation.

Truncated data represent values that couldn't have been observed, overall or for a particular individual. Your analysis then must take into account that you have no information about observations that might fall into that range. One example is if you are evaluating survival since diagnosis of a disease, but someone comes to your institution to enter your study some period of time after being diagnosed elsewhere. If someone like that person had died between diagnosis and coming to your institution, you would have had no information about survival. So the initial survival time when the individual comes to enter the study should be treated as left truncated.

Missing data are simply that: observations that weren't made or recorded. Depending on the circumstances, that might have been due to truncation, so you need to evaluate the nature of the data collection.

The way that you have presented your data, all of your "-" values seem to be missing. Your survival_time values are a possible exception, depending on whether you have additional information to fill in the blanks with some extra annotation. For example, if you had followed users 3 and 4 for a period of time but they hadn't yet died, you could use the follow-up time as a censored observation of survival_time. You know that they survived at least that long, so you have a range of possibilities. You would then need to code the data with an additional variable to indicate whether a particular survival_time represents a death or right censoring.

A classic text on censoring and truncation in survival analysis is Klein and Moeschberger. It includes many examples of different types of censored and truncated survival times, along with ways to handle them. With truly missing data, simply omitting individuals with missing data is generally not the best way to proceed. Stef van Buuren's book shows how to deal with missing data in principled ways.

Related Solutions

Kaplan Meier Interpretation – How to Interpret Kaplan Meier with Truncated and Right Censored Data

I'll give an explanation that is very close to that of Maarten Buis but just a little more elaborate. As always in survival analysis, different time scales can be applied. I think that age is maybe the more intuitive time scale in your setting, so that's where I'll start my answer. Afterwards, I'll try to use that intuition to answer the question.

Let $C_i$ be time of birth. From your data we can easily calculate ages of entering the study,

$$ A_i = t_0 - C_i $$

and age of exiting the study,

$$ B_i = \min\{T - C_i, D_i\}, $$

where $D_i$ is age at death. Now note, that we have some age interval, $(A_i, B_i]$ where the $i$'th subject is under observation. On this time scale, the study subjects do not enter the study at the same time. Let's denote the minimum of the age at entering the study,

$$ \alpha = \min_i A_i. $$

What survival information do we have before time $\alpha$? None. This is why we can't say anything about the probability of surviving the age interval $(0, \alpha]$. Necessarily, our Kaplan-Meier estimate must be conditional on survival until age $\alpha$. To give an example: Let's say that $\alpha$ is $1$ year. Would we be able to calculate the survivor function at time $5$ years, $S(5) = P(D > 5)$. Could we calculate how many children would live to see their fifth birthday? No, because we simply don't know how dangerous the first year is. We can calculate only the conditional survivor function $P(D > 5|D > 1)$. Actually, this can again be explained by a change in time scale: there is nothing special about 0, your Kaplan-Meier estimate doesn't have to start at time zero, it can start at some other time, which corresponds to e.g. the time scale defined by age minus $\alpha$. In your data, you write that $\alpha$ is very small as some children are included very young, thus, for $s > \alpha$

$$ P(D > s | D > \alpha) = S(s)/S(\alpha) \simeq S(s) $$

and actually there is equality in the limit $\alpha \rightarrow 0$ if we assume $S$ to be continuous.

Let's change back to your original time scale, plain calender time. You have no idea how dangerous the time before $t_0$ is, therefore your estimate must be conditional on surviving until $t_0$. This stems from the fact that no children are observed before time $t_0$. On this time scale, it doesn't make much of a difference how close the times of birth are to $t_0$ as we have assumed the same hazard for all ages (instead of an age-specific hazard as above). To sum up, on this time scale (using calender time), the interpretation would of the Kaplan-Meier estimate would be that of (for $t \in (t_0, T]$),

$$ P(X > t | X > t_0). $$

This is not as intuitive as on the age time scale, however, it just means that when doing a study in calender time, we condition on the subjects having survived the time from birth until the start of our study.

To answer the last part of the question, you do not condition on $T_{first}$ nor on $L_1$, you condition on survival until $t_0$ as this is the minimum of entering times. I think part of the confusion is due to the fact that all the children enter your study at the same time, which is not necessarily the case in all applications, as is evident from using age as the time scale above.

Finally, you could easily say that non-truncation corresponds to the truncation time being smaller than or equal to 0 (or some other natural starting point on a time scale).

Survival Analysis – Better Understanding of Censoring and Truncation

I think that the apparent discrepancy between the text and your professor has to do with the truncated (and thus missing) "observations" versus the implications for how you handle the data that you do have.

Yes, a truncated "observation" is one that is unavailable to the study because its value is out of range. But you don't have those truncated "observations" to work with at all. What you have is a sample of non-truncated observations. Wikipedia puts it nicely:

A truncated sample can be thought of as being equivalent to an underlying sample with all values outside the bounds entirely omitted, with not even a count of those omitted being kept.

Your professor's emphasis is on how you analyze the data that you do have. From that perspective in the example case, you have to treat the age values of the observations you have as left-truncated, as the sample provides no information about ages below the threshold. That applies to the entire sample at hand.

Section 5.3 of my edition of your text explains a standard situation that leads to right truncation: when you enroll participants into a study only after they have developed some disease. In that case, their times between some initiating cause (like an initial infection) to the event of developing overt disease provides no information about individuals who might have a longer time between initiating cause and overt disease. The example there is for individuals who developed AIDS following blood transfusions.

A mixed situation: time-varying covariates

In the provided example of a cutoff of > 60 years of age to be entered into a study, the entire sample needs to be treated as left-truncated if time = 0 is treated as date of birth. In other circumstances you need to evaluate each observation in your sample with respect to whether it needs to be treated as truncated or censored.

This can happen with time-varying covariates handled via a counting process, as in the R coxph() function. Say that an individual starts at time = 0 with a categorical covariate having level A and stays event-free until time = 5, at which time that covariate changes to level B, remains event-free until time = 7 when the covariate changes to level C, and has the event at time = 9 with covariate level still C. The data for that individual might be coded as follows:

start   stop covariate event
 0       5    A          0
 5       7    B          0
 7       9    C          1

Now you have to think about truncation and censoring for each time period for the individual. For the first time period you have simple right censoring at time = 5, as you are starting from the reference time = 0 and there is no event. It contains information with respect to a covariate value of A for the entire period starting at the time = 0 reference through time = 5, so there is no left truncation. The second time period is also right censored, here at time = 7, as there is no event then; it is treated as left truncated, as the data provide no information about a covariate level of B prior to time = 5. The third period similarly provides no information about a covariate level of C prior to time = 7 so it is left truncated, with the (uncensored) event at time = 9.

So I suppose it's best to think about truncation on a data-value by data-value basis: about what time periods do the data provide information? In some circumstances all the data values need to be considered truncated, as in the left-truncation example of the retirement home or the right-truncation when only those with the event are included in the sample. But in other situations you need to proceed more cautiously.

Best Answer

Related Solutions

Kaplan Meier Interpretation – How to Interpret Kaplan Meier with Truncated and Right Censored Data

Survival Analysis – Better Understanding of Censoring and Truncation

Related Question