Survival Analysis – Layman’s Explanation of Censoring in Survival Analysis

censoringcox-modelsurvival

I have read about what censoring is and how it needs to be accounted for in survival analysis but I would like to hear a less mathematical definition of it and a more intuitive definition (pictures would be great!). Can anyone provide me with an explanation of 1) censoring and 2) how it effects things like Kaplan-Meier curves and Cox regression?

Best Answer

Censoring is often described in comparison with truncation. Nice description of the two processes is provided by Gelman et al (2005, p. 235):

Truncated data differs from censored data that no count of observations beyond truncation point is available. With censoring the values of observations beyond the truncation point are lost, but their number is observed.

Censoring or truncation can occur for values above some level (right-censoring), below some level (left-censoring), or both.

Below you can find example of standard normal distribution that is censored at point $2.0$ (middle) or truncated at $2.0$ (right). If sample is truncated we have no data beyond the truncation point, with censored sample values above the truncation point are "rounded" to the boundary value, so they are over-represented in your sample.

enter image description here

Intuitive example of censoring is that you ask your respondents about their age, but record it only up to some value and all the ages above this value, say 60 years, are recorded as "60+". This leads to having precise information for non-censored values and no information about censored values.

Not so typical, real-life example of censoring was observed in Polish matura exam scores that caught pretty much attention on the internet. The exam is taken at the end of high school and students must pass it in order to be able to apply for higher education. Can you guess from the plot below what is the minimal amount of points that students need to get to pass the exam? Not surprisingly, the "gap" in otherwise normal distribution can be easily "filled in" if you take an appropriate fraction of the over-represented scores just above the censoring boundry.

enter image description here

In case of survival analysis

censoring occurs when we have some information about individual survival time, but we don’t know the survival time exactly

(Kleinbaum and Klein, 2005, p. 5). For example, you treat patients with some drug and observe them until end your study, but you have no knowledge what happens to them after study finishes (were there any relapses or side effects?), the only thing that you know is that they "survived" at least until end of the study.

Below you can find example of data generated from Weibull distribution modeled using Kaplan–Meier estimator. Blue curve marks model estimated on the full dataset, in the middle plot you can see censored sample and model estimated on censored data (red curve), on right you see truncated sample and model estimated on such sample (red curve). As you can see, missing data (truncation) has a significant impact on estimates, but censoring can be easily managed using standard survival analysis models.

enter image description here

This does not mean that you cannot analyze truncated samples, but in such cases you have to use models for missing data that try to "guess" the unknown information.


Kleinbaum, D.G. and Klein, M. (2005). Survival Analysis: A Self-Learning Text. Springer.

Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (2005). Bayesian Data Analysis. Chapman & Hall/CRC.

Related Question