👋Hi author of lifelines here. So what you asking is possible. The Kaplan-Meier curve gives you is a distribution of possible durations, where duration is the time between birth and death. However, given a player has existed for $N$ months, you can condition the survival function on $T > N$ to get a better estimate.
Let $S(t) = P(T \ge t)$ be the survival function. We are curious about $S(t | T \ge N)$.
$$
S(t | T\ge N) = \frac{P(T \ge t \text{ and } T \ge N)}{P(T \ge N)} = \frac{P(T \ge t)}{S(N)} = \frac{S(t)}{S(N)},\;\; t \ge N
$$
So we simply need to divide the survival function by itself evaluated at the duration seen thus far.
In your use case, you could do something like:
predictions = pd.DataFrame(index=kmf.survival_function_.index)
for ix, row in alive_individuals.iterrows():
T = row['T']
predictions[ix] = kmf.survival_function_/kmf.survival_function_.loc[T]
# can't have probabilities great than 1.
predictions[predictions > 1.0] = 1.
This gives you the new survival function. However, in lifelines, there is another utility you can use. kmf.conditional_time_to_event_
computes these conditional survival functions and then takes the median time remaining. Output using some fake data I have:
KM_estimate - Conditional time remaining to event
timeline
0.0 56.0
6.0 50.0
7.0 49.0
9.0 47.0
13.0 43.0
15.0 41.0
17.0 39.0
19.0 37.0
22.0 34.0
26.0 32.0
29.0 29.0
32.0 26.0
33.0 27.0
36.0 24.0
38.0 22.0
41.0 19.0
43.0 17.0
45.0 15.0
47.0 13.0
48.0 13.0
51.0 10.0
53.0 8.0
54.0 7.0
56.0 7.0
58.0 5.0
60.0 8.0
61.0 7.0
62.0 6.0
63.0 6.0
66.0 3.0
68.0 1.0
69.0 6.0
75.0 0.0
So if a player lives until age 62, we expect 6 more months left (6 months being the median time to death, given the player lived to 62). That may help as well.
Weights in a survival model give you flexibility in terms of data formatting or a way to try to adjust estimates for sampling that wasn't representative. Therneau and Grambsch say, in Section 7.3 of Modeling Survival Data--Extending the Cox Model (Springer, 2000):
Two distinct uses for case weights (among many uses) need to be distinguished. The first is jrequency weights; a weight of 3 means that 3 data points were actually observed, had the same values for all variables, and have been collapsed into a single observation to save space. The program should then treat an observation with a weight of k as if it had appeared k times in the input data set. The second is sampling weights. For instance, if 10% of the high-risk subjects for a condition were included in a study but only 1% of those with low or moderate risk, we would want to weight the observations inversely as the sampling fractions to reflect this design, giving case weights 10 times greater to the low/moderate-risk individuals than to the high-risk ones.
For a Kaplan-Meier estimate, you could just weight both the deaths and the numbers at risk at each event time $i$ by the individual case weights. Think, for example, about the first example in the quote above: for a case weight of 2, you just double-count the weighted case in the denominator so long as it is at risk, and give it a count of 2 in the numerator at its event time. I'm not sure that's how it's implemented in the survival
package; you could check by examining the C source code for Csurvfitkm
, which does the main calculations. Perhaps Thomas Lumley, who used to maintain the package, could discuss further.
For the Cox partial likelihood solution, it's essentially what you propose. For the unweighted situation, differentiating the log partial likelihood with respect to the parameter-value vector $\theta$ gives a score vector (Equation 3.4 of Therneau and Grambsch):
$$ U(\theta) = \sum_{i=1}^n \int_0^{\infty} \left[X_i(s) - \bar x(\theta,s)\right] dN_i(s) = \sum_{i=1}^n U_i(\theta)$$
where $X_i$ represents the covariate values for case $i$ and $\bar x$ is a risk-weighted mean of $X$ over observations at risk.* The maximum partial likelihood estimator $\hat \theta$ solves:
$$\sum_{i=1}^n U_i(\hat \theta) = 0. $$
With case weights $w_i$, you instead solve:
$$\sum_{i=1}^n w_i U_i(\hat \theta) = 0. $$
while also case-weighting the contributions to $\bar x(\theta,s)$* in the score vector. Handling of variances and the information matrix is similar. See Section 7.3 of Therneau and Grambsch.
Note that the coxph
default Efron approximation for tied event times is implemented via temporary case weights even in an unweighted Cox regression; see Section 5.1 of the main R survival vignette.
Case weights do affect some other calculations in coxph()
. For example, non-integer case weights (as you might have in inverse propensity score weighting) lead to calculation of a robust variance estimate; see Section 2.7 of that vignette.
*The risk score for case $i$ in a regression without case weights is $r_i(\theta,s) =\exp[\theta' X_i(s)]$. Then the risk-weighted covariate average is:
$$\bar x(\theta,s) = \frac{\sum Y_i(s) r_i(s)X_i(s)}{\sum Y_i(s) r_i(s)},
$$
where $Y_i(s)$ is the at-risk indicator for time $s$. In the case-weighted regression, $r_i$ becomes $w_i r_i$.
Best Answer
You can only do what you seek if, as in the answer from @clementzach, you have multiple events at each event time and you can choose the censored cases very specifically. You have written the Kaplan-Meier estimator as:
$$\hat S(\tau_k) = \prod_{i=0}^k\left[ 1- \frac{d_i}{n_i}\right] $$
where $d_i$ is the number of cases having the event at time $\tau_i$ and $n_i$ is the corresponding number at risk at that time. At each event time $\tau_i$, the ratio $d_i/n_i$ represents the fractional change of the survival curve from its previous value toward 0 survival. So as you censor you must do so in a way that $d_i/n_i$ is unchanged for all event times $\tau_i$.
If you have any event time $\tau_j$ with only one individual having the event at that time (as I implicitly assumed in my comments) and right-censor that individual's event time, $d_j=0 $ after censoring. Thus $\hat S(\tau_j)$ no longer has a drop at time $\tau_j$, changing the Kaplan-Meier curve. The number at risk will be the same after $\tau_j$, but the baseline from which the next event leads to a drop will be higher than previously. Thus the magnitude of a drop at a subsequent event time might also change.
Similarly, consider a one-individual event time at $\tau_j$ with some later event times right-censored before $\tau_j$. Then the Kaplan-Meier curve after censoring is also changed from the uncensored version. In this case, although $d_j$ is still 1, $n_j$ is now lower than without censoring so the drop in the curve at $\tau_j$ is greater.
Thus a necessary condition to have a completely unchanged Kaplan-Meier curve is to only censor event times of individuals who share event times with others and not to censor event times subsequent to times at which only 1 individual has an event.
You also have to be very careful in how you choose the individuals whose event times you censor. You have to do that in a way that keeps the ratio of event numbers to those at-risk, $d_i /n_i$, unchanged at each event time $i$. As the answer from @clementzach (+1) shows, you might be able to do this if you censor between event times proportionately from those still at risk at each subsequent event time. (Random censoring leads to random changes in $d_i/n_i$ from the uncensored situation.) That is, you have to make sure that the changes in numbers of cases with events is exactly balanced by the change in the number at risk at each event time.
One way to do that is, between 2 event times, to evaluate the number still at risk at each subsequent event time and censor the exact same proportion from those having events at each subsequent event time. For a completely unchanged Kaplan-Meier curve you also, however, need a large enough sample size that you don't end up with rounding differences in $d_i/n_i$ between the censored and uncensored situations.
Here's an example with a large number of cases and only 4 event times. Set up an uncensored data set with 400 cases, 100 having each event time, in order of their event times.
Make a copy to censor
Before the first event time, censor the same fraction of those at risk from those having events at each event time. Here, censor 20% from each event time prior to the first event time.
Between the first and the second event times, censor some fraction of those still at risk, again with the same fraction censored from those still at risk at subsequent times. Here, censoring 10 out of 80 still at risk for each subsequent event time.
Similarly censor between the second and third event times, here censoring 30 of 70 still at risk at each subsequent event time.
Censor as many as you'd like of those still at risk before the last event time, so long as you leave at least 1.
Compare the uncensored and censored survival curves. The dashed blue censored curve exactly matches the original survival function in black.
So although it is sometimes possible to have a censored Kaplan-Meier curve that matches the original uncensored version in terms of the survival function, it requires a very specific set of circumstances, careful attention to proportionality of censoring of those still at risk, and a large enough data set that you don't end up with differences from rounding.