# Survival Analysis – How to Build a Probability Censoring Function Without Affecting Kaplan-Meier Survival Function

kaplan-meiersurvival

If i have the complete data of a subject (un-censored), How can I design a probability function for censoring the data such that the survival function value will not change?

What is the condition for censoring the data of such a function in each point in time?

Edit:
KM survival function is defined as:
$$S(\tau_k) = \Pi^k_{i=0}(1-d_i\div n_i)$$
Where $$d_i$$ is the number of people with event at every time interval.
and $$n_i$$ is the people at risk at every time interval.

So cosidering this, what is the condition of censoring data so that at each time interval, the KM survival function value of the complete data will have the same value as the KM survival function value of the censored-version of the same data?

You can only do what you seek if, as in the answer from @clementzach, you have multiple events at each event time and you can choose the censored cases very specifically. You have written the Kaplan-Meier estimator as:

$$\hat S(\tau_k) = \prod_{i=0}^k\left[ 1- \frac{d_i}{n_i}\right]$$

where $$d_i$$ is the number of cases having the event at time $$\tau_i$$ and $$n_i$$ is the corresponding number at risk at that time. At each event time $$\tau_i$$, the ratio $$d_i/n_i$$ represents the fractional change of the survival curve from its previous value toward 0 survival. So as you censor you must do so in a way that $$d_i/n_i$$ is unchanged for all event times $$\tau_i$$.

If you have any event time $$\tau_j$$ with only one individual having the event at that time (as I implicitly assumed in my comments) and right-censor that individual's event time, $$d_j=0$$ after censoring. Thus $$\hat S(\tau_j)$$ no longer has a drop at time $$\tau_j$$, changing the Kaplan-Meier curve. The number at risk will be the same after $$\tau_j$$, but the baseline from which the next event leads to a drop will be higher than previously. Thus the magnitude of a drop at a subsequent event time might also change.

Similarly, consider a one-individual event time at $$\tau_j$$ with some later event times right-censored before $$\tau_j$$. Then the Kaplan-Meier curve after censoring is also changed from the uncensored version. In this case, although $$d_j$$ is still 1, $$n_j$$ is now lower than without censoring so the drop in the curve at $$\tau_j$$ is greater.

Thus a necessary condition to have a completely unchanged Kaplan-Meier curve is to only censor event times of individuals who share event times with others and not to censor event times subsequent to times at which only 1 individual has an event.

You also have to be very careful in how you choose the individuals whose event times you censor. You have to do that in a way that keeps the ratio of event numbers to those at-risk, $$d_i /n_i$$, unchanged at each event time $$i$$. As the answer from @clementzach (+1) shows, you might be able to do this if you censor between event times proportionately from those still at risk at each subsequent event time. (Random censoring leads to random changes in $$d_i/n_i$$ from the uncensored situation.) That is, you have to make sure that the changes in numbers of cases with events is exactly balanced by the change in the number at risk at each event time.

One way to do that is, between 2 event times, to evaluate the number still at risk at each subsequent event time and censor the exact same proportion from those having events at each subsequent event time. For a completely unchanged Kaplan-Meier curve you also, however, need a large enough sample size that you don't end up with rounding differences in $$d_i/n_i$$ between the censored and uncensored situations.

Here's an example with a large number of cases and only 4 event times. Set up an uncensored data set with 400 cases, 100 having each event time, in order of their event times.

df400 <- data.frame(time=rep(1:4,times=rep(100,4)),event=1)


Make a copy to censor

df400cens <-df400


Before the first event time, censor the same fraction of those at risk from those having events at each event time. Here, censor 20% from each event time prior to the first event time.

df400cens[c(1:20,101:120,201:220,301:320),"event"] <- 0
df400cens[c(1:20,101:120,201:220,301:320),"time"] <- 0.5


Between the first and the second event times, censor some fraction of those still at risk, again with the same fraction censored from those still at risk at subsequent times. Here, censoring 10 out of 80 still at risk for each subsequent event time.

df400cens[c(121:130,221:230,321:330),"event"] <- 0
df400cens[c(121:130,221:230,321:330),"time"] <- 1.5


Similarly censor between the second and third event times, here censoring 30 of 70 still at risk at each subsequent event time.

df400cens[c(231:260,331:360),"event"] <- 0
df400cens[c(231:260,331:360),"time"] <- 2.5


Censor as many as you'd like of those still at risk before the last event time, so long as you leave at least 1.

df400cens[361:399,"event"] <- 0
df400cens[361:399,"time"] <- 3.5


Compare the uncensored and censored survival curves. The dashed blue censored curve exactly matches the original survival function in black.

plot(survfit(Surv(time,event)~1,data=df400),conf.int=FALSE,mark.time=TRUE,bty="n",xlab="Time",ylab="Fraction surviving")
lines(survfit(Surv(time,event)~1,data=df400cens),conf.int=FALSE,mark.time=TRUE,bty="n",xlab="Time",ylab="Fraction surviving",lty=2,col="blue",lwd=3)


So although it is sometimes possible to have a censored Kaplan-Meier curve that matches the original uncensored version in terms of the survival function, it requires a very specific set of circumstances, careful attention to proportionality of censoring of those still at risk, and a large enough data set that you don't end up with differences from rounding.