Survival Models – How Survival Models Account for Censoring

censoringcox-modelsurvival

Background

I'm teaching an intro stats class in our social / health sciences department and I'm finding myself tripped up on something I'd always taken for granted: namely, the claim that survival analysis methods (from Kaplan-Meier to Cox models) "account for" censored data, and that this is one of these methods' central advantages over other approaches.

(Relevant background: in my research I use, mostly clumsily, some applied statistical and quantitative methods, and I have a good grasp of and intuition for some of the basics. So I'm an okay generalist but I'm not a statistician nor even a statistics grad student by any means.)

The Problem

As I'm preparing some PowerPoint slides for these students, though, I'm realizing that while I have a pretty good grasp on what censoring is in survival methods, I have no idea how something like a Cox model "accounts" for it. I'd never thought about it, and instead assumed that this is just what happened. (It's possible that a professor walked us through this earlier in my education, but it's also possible that I used to struggle in 8am stats lectures.)

Censoring, as you all already know, is the state in which we have some information about individual survival time, but we don't know the survival time exactly (Kleinbaum & Klein, p. 5; I've always liked this plain English explanation). In my experience, sometimes when stats people are trying to sell someone on survival methods, they'll say things like "logistic regression doesn't account for censoring, but Cox regression does!"

There are some great posts on CV about censoring; this is maybe my favorite example. In it, user @Tim gives a terrific explanation of censoring:

Intuitive example of censoring is that you ask your respondents about their age, but record it only up to some value and all the ages above this value, say 60 years, are recorded as "60+". This leads to having precise information for non-censored values and no information about censored values.

I think this is brilliant (in fact I plan on borrowing liberally from it during the lecture, with credit given of course). But it doesn't get really get into how survival analyses actually deal with this, and whether it's a selling point for survival methods ("our methods can do this and yours do not") or just something that pops up when you try and ask survival-type questions ("how long will people live, on average, after being given treatment X?").

The Questions

  1. Is it true, strictly speaking, that something like a Kaplan-Meier estimator or a Cox proportional hazards model is "accounting" for censoring?

  2. If so, how is it doing that?

  3. If a survival model indeed accounts for censoring, is this a "feature" of survival methods over others, or a "bug", an inevitable artifact of the sorts of questions one uses survival methods to answer?

My guesses

Well, not guesses, maybe more like very unclear intuitions I'm not too confident about:

  1. Why is censoring a problem? I'm thinking that, if ignored, censoring is a huge potential source of bias in making survival estimates: if you don't know what happened after Mr. Smith dropped out of your study (i.e. was lost to follow-up, i.e. was censored) your estimates may be off in one direction or the other. Maybe he lived a long, long time — or maybe he died the next day. If this is happening to lots of people in the same way, your estimates may be really, really off.

  2. So maybe what survival models have found a way to do is keep in the analysis everyone who contributed survival time, regardless of whether we know about their outcome status, while other methods would simply drop all of those people's data as missing.

Am I way, way off here?

Best Answer

Censoring is built into survival models by incorporating it into the likelihood function underlying the analysis. The most common form of censoring occurs when we observe an item for a finite period of time $T$ and it does not fail in that time. Below I will show you how the censoring is built into the likelihood function and how this affects the Cox proportional hazards model.


Incorporating censored data into the likelihood function: As a common example, suppose we have items where the time-to-failure has a survival function $S$ and corresponding density function $f$, both of which area parameterised by some parameter $\theta$. If an item $i$ is observed to fail at time $0 \leqslant t_i \leqslant T$ then it is incorporated into the likelihood function using the density term:

$$f(t_i | \theta).$$

However, if an item $i$ is observed throughout the whole time $T$ and it does not fail then this is considered to be a "right-censored" data point (only known to fail at some time after $G$) and it is incorporated into the likelihoood function using the survival term:

$$S(T|\theta).$$

Suppose we have a survival model based on observation for a fixed period of length $T$, where the times-to-failure for each observation are IID conditional on some underlying parameters. Without further loss of generality, we will have $n$ observed failures at times $t_1,...,t_n$ (all within the interval $[0,T]$) and we will will have $m$ right-censored values that did not fail in the oberved time $T$. The overall likelihood function for this data is then given by:

$$L_\mathbf{t}(\theta) = \bigg( \prod_{i=1}^n f(t_i|\theta) \bigg) \times S(T|\theta)^m.$$

In this likelihood function you can see that the censoring of data is "built in" by the fact that right-censored values are incorporated through their survival function instead of the density function for the time-to-failure.


Extending to the Cox proportional hazards model: The Cox proportional hazards model still uses a likelihood function for the observed times-to-failure and survival times, but it now adds covariates to the data and uses an assumption of proportional hazards in how these manifest in the hazard function. This does not change the underlying method of how censored values are built into the likelihood function --- e.g., right-censored values still enter through their survival function instead of the density of the time-to-failure.


Extension to other kinds of censoring: The above shows the common case where we have right-censored observations with the same censoring time $T$. Of course, this is not the only kind of censorship that can occur. Another possibility is that we might observe items up to different end-times, in which case the right-censored values would occur with different observation periods $t_{n+1},...,t_{n+m}$. In this case the likelihood function would be generalised to:

$$L_\mathbf{t}(\theta) = \bigg( \prod_{i=1}^n f(t_i|\theta) \bigg) \times \bigg( \prod_{i=1}^m S(t_{n+i}|\theta) \bigg).$$

Another possibility (which is uncommon in survival analysis) is left-censorship, where we know that an item failed no later than some time $T_i$. Left-censored observations enter into the likelihood function through the cumulative distribution function $F$. If we extend our model to assume that we have $r$ left-censored observations with observation times $t_{n+m+i},...,t_{n+m+r}$ then the likelihood function would be further generalised to:

$$L_\mathbf{t}(\theta) = \bigg( \prod_{i=1}^n f(t_i|\theta) \bigg) \times \bigg( \prod_{i=1}^m S(t_{n+i}|\theta) \bigg) \times \bigg( \prod_{i=1}^r F(t_{n+m+i}|\theta) \bigg).$$

And of course, you can extend this event further to allow for more complicated kinds of censorship. In general, if a censored observation is known to fall in some set $\mathscr{A}$ then it should enter into the likelihood function through the probability term:

$$\mathbb{P}(t_i \in \mathscr{A}|\theta) = \int \limits_\mathscr{A} f(t|\theta) \ dt.$$

Related Question