Censored Data – Is Uncensored Data More Informative Compared to Censored Data

censoringregressionsurvival

I am told that one of the main benefits of Survival Analysis models are their ability to handle Censored Data. This is in contrast to standard regression models that are unable to do so.

For example, suppose researchers in a medical study are interested in knowing if a certain drug is able to prolong the life of patients with a certain disease : in this medical study, a patient dying is considered as the "event. Suppose one of the patients has to move to a new country 5 years after the study has started and we are no longer able to collect data on this patient. We know that the patient survived for at least 5 years. In a classical regression model, the data for this patient would be considered as "incomplete" and we would not be able to use the data for this patient in the model. However, a Survival Analysis model would let us use to the "complete part of the incomplete data" belonging to this patient – thus allowing our model to profit from potentially useful information that would have been otherwise discarded by a standard regression model. In the context of Survival Analysis, this particular patient would be labelled as "Censored".

I am interested in the following (obvious) question: Suppose we have a dataset that contains no Censored Data – for the purpose of a simulation, we decide to randomly "censor" the data belonging to some of the patients. Can we somehow show that estimates from Survival Models (e.g. Kaplan-Meier, AFT, Cox PH) would be "better" on the same dataset when there is less censoring compared to more censoring? (e.g. one dataset has no censoring, one dataset has 5% random censoring, one dataset has 10% random censoring – we fit Survival Models to all 3 datasets and compare the quality of the estimates)

I am aware that Survival Models do not require there to be Censoring in the dataset, and I am also aware that higher levels of Censoring are considered undesirable for Survival Analysis models – but is there some mathematical proof that shows the "decline" in the estimates provided by Survival Analysis models when higher levels of Censoring are present?

Best Answer

We could rephrase your question asking whether methods based on full data (i.e. noncensored data) are necessarily more efficient than methods based on observed data (i.e. censored data). This question can be answered in general by semiparametric efficiency theory.

Let $Z$ denote the full data (such as covariates and failure time). Suppose we have a data set of i.i.d. draws $Z_1, \dots Z_n$. A full data estimator $\hat\beta$ for an estimand $\beta^*$ is asymptotically linear with influence function $\varphi^F$ if $$\sqrt{n} ( \hat\beta - \beta^*) = \frac{1}{\sqrt{n}} \sum_{i=1}^n \varphi^F(Z_i) + o_P(n^{-1/2}).$$ Such an estimator has asymptotic variance $\mathrm{var}\left\{ \varphi^F(Z) \right\}$. Likewise, let $\mathcal{O}$ be the observed data, which denotes the full data $Z$ subject to coarsening or missingness. We can similarly define the influence function $\varphi$ for an observed data estimator.

This suggests that we can compare the efficiency of observed data estimators and full data estimators through comparisons of their influence functions. Rather than studying the influence function of a given estimator, we can study the class of influence functions of all regular estimators of the estimand $\beta^*$.

Lemma 7.4 in Tsiatis (2006) establishes the relationship between the class of influence functions of observed data estimators and the corresponding class for full data estimators. He shows that the class of observed data influence functions equals \begin{equation*} \frac{I(\mathcal{C}=\infty)}{\varpi(\infty, Z)} \varphi^F(Z) + L_2(\mathcal{O}), \end{equation*} where $\mathcal{C}=\infty$ denotes that the full data is observed ( i.e. $T \leq C$ in survival analysis), $\varpi(\infty, Z) = \mathbb{P}[\mathcal{C}=\infty \mid Z]$ is the conditional probability of observing the full data $L_2$ is an arbitrary function satisfying $\mathbb{E}[L_2(\mathcal{O})\mid Z] = 0$, and $\varphi^F$ is an arbitrary full data influence function.

Based on this identity, we can derive the asymptotic variance of an observed data asymptotically linear estimator with influence function $\varphi$ as \begin{align*} & \mathrm{var} \left\{ \varphi(\mathcal{O}) \right\} \\ =\, & \mathrm{var} \left[ \mathbb{E} \left\{ \varphi(\mathcal{O}) \mid Z \right\} \right] + \mathbb{E} \left[ \mathrm{var} \left\{ \varphi(\mathcal{O}) \mid Z \right\} \right] \\ =\, & \mathrm{var} \left[ \mathbb{E} \left\{ \frac{I(\mathcal{C}=\infty)}{\varpi(\infty, Z)} \varphi^F(Z) + L_2(\mathcal{O}) \mid Z \right\} \right] + \mathbb{E} \left[ \mathrm{var} \left\{ \varphi(\mathcal{O}) \mid Z \right\} \right] \\ =\, & \mathrm{var} \left[ \mathbb{E} \left\{ \frac{I(\mathcal{C}=\infty)}{\varpi(\infty, Z)} \varphi^F(Z) \mid Z \right\} \right] + \mathbb{E} \left[ \mathrm{var} \left\{ \varphi(\mathcal{O}) \mid Z \right\} \right] \\ =\, & \mathrm{var} \left[ \varphi^F(Z) \right] + \mathbb{E} \left[ \mathrm{var} \left\{ \varphi(\mathcal{O}) \mid Z \right\} \right] \\ \succcurlyeq\, & \mathrm{var} \left[ \varphi^F(Z) \right] & \end{align*}

This shows that any observed data estimator has higher variance than its corresonding full data estimator. The inequality is tight when the second summand has conditional variance zero: this means that the observed data equals the full data. In a survival analysis setting, this shows that whenever censoring is present, the observed data estimators are less efficient than the full data estimators.

Related Question