Survival Analysis – Why Large Amounts of Censoring Are Problematic in Survival Studies

I am trying to understand why large amounts of censoring (ie. many patients are censored) is undesirable in Survival Analysis.

As a proof of concept, suppose there are 5 patients – all patients enter the study at the same time:

patient1 has event at t1
patient2 has event at t2
patient3 drops out of the study at t3
patient4 has event at t4
and when the study is over at t5, patient 5 has not had the event
t5 > t4 > t3 > t2 > t1

(Semi Parametric Approach) Here is my attempt to write the model and likelihood for a Cox-PH regression in this situation:

$$ h(t|X) = h_0(t) \exp(\beta^T X) $$
$$ L(\beta) = \prod_{i: \delta_i = 1} \frac{h(t_i|X_i)}{\sum_{j: t_j \geq t_i} \exp(\beta^T X_j)} $$
$$ L(\beta) = \frac{h(t_1|X_1)}{\sum_{j: t_j \geq t_1} \exp(\beta^T X_j)} \times \frac{h(t_2|X_2)}{\sum_{j: t_j \geq t_2} \exp(\beta^T X_j)} \times \frac{h(t_4|X_4)}{\sum_{j: t_j \geq t_4} \exp(\beta^T X_j)} $$

$$ L(\beta) = \frac{h(t_1|X_1)}{\exp(\beta^T X_1) + \exp(\beta^T X_2) + \exp(\beta^T X_3) + \exp(\beta^T X_4) + \exp(\beta^T X_5)} \times \frac{h(t_2|X_2)}{\exp(\beta^T X_2) + \exp(\beta^T X_3) + \exp(\beta^T X_4) + \exp(\beta^T X_5)} \times \frac{h(t_4|X_4)}{\exp(\beta^T X_3) + \exp(\beta^T X_5)} $$

(Parametric Approach) Here is my attempt to write the model and likelihood for a AFT model in this situation (note that the likelihood is based on the distribution of $\epsilon$ and not the survival times $T$ . I have heard that if we pick $T$ to have distributions such as Exponential or Weibull, then $\epsilon$ results in Extreme Value Distribution such as the Gumbel Distribution ):

$$ \log(T) = \mu + \beta^T X + \sigma \epsilon $$

$$ L(\mu, \sigma, \beta) = \prod_{i=1}^{n} \left[ f\left( \frac{\log(t_i) – \mu – \beta^T X_i}{\sigma} \right) \right]^{\delta_i} \left[ 1 – F\left( \frac{\log(t_i) – \mu – \beta^T X_i}{\sigma} \right) \right]^{1-\delta_i} $$

$$ L(\mu, \sigma, \beta) = \left[ f\left( \frac{\log(t_1) – \mu – \beta^T X_1}{\sigma} \right) \right] \times \left[ f\left( \frac{\log(t_2) – \mu – \beta^T X_2}{\sigma} \right) \right] \times \left[ 1 – F\left( \frac{\log(t_3) – \mu – \beta^T X_3}{\sigma} \right) \right] \times \left[ f\left( \frac{\log(t_4) – \mu – \beta^T X_4}{\sigma} \right) \right] \times \left[ 1 – F\left( \frac{\log(t_5) – \mu – \beta^T X_5}{\sigma} \right) \right] $$

So in the Cox-Ph and AFT model, how is inference (e.g. perhaps it results in high variance, high bias, non-consistency, larger sample sizes to achieve compared results vs smaller sample sizes with lesser censoring) and parameter estimation negatively affected when large numbers of the patients are censored? Does the mathematical optimization become difficult (e.g. incomplete matrix rank, matrix inverses not defined, non-identifiable model)?

import numpy as np import pandas as pd import matplotlib.pyplot as plt from lifelines import KaplanMeierFitter n = 10000 t_max = 1 t = 0.7 * np.random.weibull(1.5, size=n) # Event times t = np.where(t >= t_max, t_max, t) # Admin censoring at 1 year c0 = 2 # Censoring (none) c1 = 1.5 * np.random.weibull(2., size=n) # Censoring (some) c2 = 0.9 * np.random.weibull(0.9, size=n) # Censoring (more)

t0_star = t # New variable for t delta0 = np.where(t >= t_max, 0, 1) # Event indicator (all events expect admin censor) # Using Kaplan-Meier, since equivalent to CDF here km0 = KaplanMeierFitter() km0.fit(t0_star, delta0) km0_St = km0.survival_function_ # Survival function

# Setting up data we get to observed in this case t1_star = np.min([t, c1], axis=0) delta1 = np.where((t <= c1) & (t < t_max), 1, 0) # Upper bound computation t1_staru = np.where(delta1 == 0, t_max, t1_star) # All never events km1u = KaplanMeierFitter() km1u.fit(t1_staru, delta1) km1u_St = km1u.survival_function_ # Lower bound computation delta1l = np.where(t1_star < t_max, 1, delta1) # All events at censoring times km1l = KaplanMeierFitter() km1l.fit(t1_star, delta1l) km1l_St = km1l.survival_function_ # Merging bounds into single data object for plotting bounds = pd.merge(km1l_St, km1u_St, left_index=True, right_index=True, how='outer') bounds.ffill(inplace=True) # Plotting the bounds plt.fill_between(bounds.index, bounds.KM_estimate_x, bounds.KM_estimate_y, step='post') plt.show()

Best Answer

I'm going to give a slightly different answer from the previous responses, that also has a more visual tint. For simplicity, suppose we want to estimate the survival function up to 1 year. To show how censoring impacts the analysis, I am going to use nonparametric bounds.

The bounds are the best/worst cases on the survival function given the observed data that are going to place no parametric assumptions. The bounds represent the extremes of what censoring is capable of doing to our estimates given the observed data. For the upper bound on the survival function, we will assume that all censored individuals do not ever have the event (up to 1 year). For the lower bound, we will assume that all censored individuals immediately had the event after they were censored.

The following is some code to generate survival times for $n=10000$ observations with different extents of censoring. In the zero case, there is no censoring. In the first and second cases, more censoring occurs.

For c0, there is no censoring. The following code is for the bounds (since no censoring, the bounds are the same as the point estimate of the Kaplan-Meier)

The following is code to get the bounds for censoring in case 1 (c1).

When we apply this for each of the scenarios, we can get the following plot

So, as you can see, the bounds get wider the more censoring that occurs. Large amounts of censoring is bad in survival analysis because we have less information in the data. To compute point estimates for the survival function (with Cox, AFT, Kaplan-Meier, etc. models), we need to rely on assumptions regarding how censoring occurs. The more censoring that occurs, the more we have to 'lean' on those assumptions. So, that is why less censoring is generally better.

Best Answer

Related Solutions

Related Question