Survival Analysis – Better Understanding of Censoring and Truncation

censoringsurvivaltruncation

I am taking a course in survival analysis where we follow the book "Survival Analysis: Techniques for Censored and Truncated Data" by John P. Klein and Melvin L. Moeschberger.

Although I like the book in general, I don't really understand the way truncation is explained in the book and my professor's explanation hasn't helped too much either – and what is worse they seem to conflict a little bit in my head.


From my current understanding censoring in survival analysis occurs when either the event we are trying to model the time to has already been triggered for an observation in the study (left-censoring) or if for the duration of the study we don't see the event occur for an observation (right-censoring). Hence censoring is (1) a local property in the sense that we talk about a observation being censored, and (2) censored observations are in the study.

From my current understanding, truncation on the other hand occurs when the study is conducted in such a way that some potential observations are left out.

So the main difference between censoring and truncation is whether or not we have access to all potential observations or not, is it not?

Now here is where the book and my professor's explanations differ a bit:

  1. Book: A truncated observation is one that has been left out due to some restriction of the study. For instance if we are modelling time to death but only looking at people at retirement homes, we miss people under the age threshold (say 60) to be admitted to the retirement homes. Hence dead people younger than 60 are truncated in this study.

  2. Professor: Truncation occurs when observations are only in the study given some condition. In the same example ALL observations are truncated since they have survived beyond age 60 and are only in the study because of this condition.

So from what I understand the book claims truncation is a local property, while my professor claims that it is a global property (in the sense that the data set is truncated). Which one is the correct one? If there are different conventions, then which one is the most common one?

Also, the above example is an example of left truncation. I am having a hard time coming up with a natural example of right truncation? Does anyone have a good one?

Best Answer

I think that the apparent discrepancy between the text and your professor has to do with the truncated (and thus missing) "observations" versus the implications for how you handle the data that you do have.

Yes, a truncated "observation" is one that is unavailable to the study because its value is out of range. But you don't have those truncated "observations" to work with at all. What you have is a sample of non-truncated observations. Wikipedia puts it nicely:

A truncated sample can be thought of as being equivalent to an underlying sample with all values outside the bounds entirely omitted, with not even a count of those omitted being kept.

Your professor's emphasis is on how you analyze the data that you do have. From that perspective in the example case, you have to treat the age values of the observations you have as left-truncated, as the sample provides no information about ages below the threshold. That applies to the entire sample at hand.

Section 5.3 of my edition of your text explains a standard situation that leads to right truncation: when you enroll participants into a study only after they have developed some disease. In that case, their times between some initiating cause (like an initial infection) to the event of developing overt disease provides no information about individuals who might have a longer time between initiating cause and overt disease. The example there is for individuals who developed AIDS following blood transfusions.

A mixed situation: time-varying covariates

In the provided example of a cutoff of > 60 years of age to be entered into a study, the entire sample needs to be treated as left-truncated if time = 0 is treated as date of birth. In other circumstances you need to evaluate each observation in your sample with respect to whether it needs to be treated as truncated or censored.

This can happen with time-varying covariates handled via a counting process, as in the R coxph() function. Say that an individual starts at time = 0 with a categorical covariate having level A and stays event-free until time = 5, at which time that covariate changes to level B, remains event-free until time = 7 when the covariate changes to level C, and has the event at time = 9 with covariate level still C. The data for that individual might be coded as follows:

start   stop covariate event
 0       5    A          0
 5       7    B          0
 7       9    C          1

Now you have to think about truncation and censoring for each time period for the individual. For the first time period you have simple right censoring at time = 5, as you are starting from the reference time = 0 and there is no event. It contains information with respect to a covariate value of A for the entire period starting at the time = 0 reference through time = 5, so there is no left truncation. The second time period is also right censored, here at time = 7, as there is no event then; it is treated as left truncated, as the data provide no information about a covariate level of B prior to time = 5. The third period similarly provides no information about a covariate level of C prior to time = 7 so it is left truncated, with the (uncensored) event at time = 9.

So I suppose it's best to think about truncation on a data-value by data-value basis: about what time periods do the data provide information? In some circumstances all the data values need to be considered truncated, as in the left-truncation example of the retirement home or the right-truncation when only those with the event are included in the sample. But in other situations you need to proceed more cautiously.

Related Question