Survival Analysis – Understanding Censoring, Truncated, and Missing Data

censoringmissing datasurvivaltruncation

Can someone please explain the difference between censored, truncated and missing data in survival analysis? Suppose I have following information.

   user_ID    survival_time(y) age(x1) tumor_size(x2)
      1        10               2         5
      2        3                3         4
      3        _                _         3
      4        _                _         _

I am concerned about user 3 and 4. Are those subjected considered as censored, truncated or missing? If I exclude them in the study does it lead to biased results when modeling survival time?

Best Answer

It's probably best to think about censored, truncated and missing data in terms of data values rather than in terms of individuals, as you seem to be doing.

Censored data are observations for which you only know some range within which the value lies. In survival analysis, that's typically when you've followed up an individual for a period of time but the event (e.g., death) hasn't yet occurred. Then the follow-up time is a lower limit to the survival_time, a right-censored observation.

Truncated data represent values that couldn't have been observed, overall or for a particular individual. Your analysis then must take into account that you have no information about observations that might fall into that range. One example is if you are evaluating survival since diagnosis of a disease, but someone comes to your institution to enter your study some period of time after being diagnosed elsewhere. If someone like that person had died between diagnosis and coming to your institution, you would have had no information about survival. So the initial survival time when the individual comes to enter the study should be treated as left truncated.

Missing data are simply that: observations that weren't made or recorded. Depending on the circumstances, that might have been due to truncation, so you need to evaluate the nature of the data collection.

The way that you have presented your data, all of your "-" values seem to be missing. Your survival_time values are a possible exception, depending on whether you have additional information to fill in the blanks with some extra annotation. For example, if you had followed users 3 and 4 for a period of time but they hadn't yet died, you could use the follow-up time as a censored observation of survival_time. You know that they survived at least that long, so you have a range of possibilities. You would then need to code the data with an additional variable to indicate whether a particular survival_time represents a death or right censoring.

A classic text on censoring and truncation in survival analysis is Klein and Moeschberger. It includes many examples of different types of censored and truncated survival times, along with ways to handle them. With truly missing data, simply omitting individuals with missing data is generally not the best way to proceed. Stef van Buuren's book shows how to deal with missing data in principled ways.