I was asked to re-post this answer here from my comment at http://doingbayesiandataanalysis.blogspot.com/2012/01/complete-example-of-right-censoring-in.html
The specifics of this answer relate to the model in that comment, but the concepts apply to the topic here.
The core of the JAGS model for censored data is this:
isCensored[i] ~ dinterval( y[i] , censorLimitVec[i] )
y[i] ~ dnorm( mu , tau )
The key to understanding what JAGS is doing is that JAGS automatically imputes a random value for any variable that is not specified as a constant in the data. Thus, when y[i]
is NA
(i.e., a missing value, not a constant), then JAGS imputes a random value for it.
But what value should it generate?
The second line of the model, above, says that y[i]
should be randomly generated from a normal distribution with mean mu and precision tau.
But the first line of the model, above, puts another constraint on the randomly generated value of y[i]
. That line says that whatever value of y[i]
is randomly generated, it must fall on the side of censorLimitVec[i]
dictated by the value of isCensored[i]
.
To understand this part, let's unpack the dinterval()
distribution. Suppose that censorLimitVec
has 3 values in it, not just 1:
censorLimitVec = c(10,20,30)
Then randomly generated values from dinterval(y,c(10,20,30))
will be either 0, 1, 2, or 3 depending on whether $y<10$, $10 < y < 20$, $20<y<30$, or $30<y$. So, if $y=15$, dinterval(y,c(10,20,30))
has output of $1$ with 100% probability. The trick is this: We instead specify the output of dinterval
, and impute a random value of y
that could produce it. Thus, if we say
1 ~ dinterval(y,c(10,20,30))
then y
is imputed as a random value between 10 and 20.
Putting the two model statements together,
1 ~ dinterval( y , censorLimit )
y ~ dnorm( mu , tau )
means that y
comes from a normal density and y
must fall above the censorLimit
.
Hope that helps!!
👋Hi author of lifelines here. So what you asking is possible. The Kaplan-Meier curve gives you is a distribution of possible durations, where duration is the time between birth and death. However, given a player has existed for $N$ months, you can condition the survival function on $T > N$ to get a better estimate.
Let $S(t) = P(T \ge t)$ be the survival function. We are curious about $S(t | T \ge N)$.
$$
S(t | T\ge N) = \frac{P(T \ge t \text{ and } T \ge N)}{P(T \ge N)} = \frac{P(T \ge t)}{S(N)} = \frac{S(t)}{S(N)},\;\; t \ge N
$$
So we simply need to divide the survival function by itself evaluated at the duration seen thus far.
In your use case, you could do something like:
predictions = pd.DataFrame(index=kmf.survival_function_.index)
for ix, row in alive_individuals.iterrows():
T = row['T']
predictions[ix] = kmf.survival_function_/kmf.survival_function_.loc[T]
# can't have probabilities great than 1.
predictions[predictions > 1.0] = 1.
This gives you the new survival function. However, in lifelines, there is another utility you can use. kmf.conditional_time_to_event_
computes these conditional survival functions and then takes the median time remaining. Output using some fake data I have:
KM_estimate - Conditional time remaining to event
timeline
0.0 56.0
6.0 50.0
7.0 49.0
9.0 47.0
13.0 43.0
15.0 41.0
17.0 39.0
19.0 37.0
22.0 34.0
26.0 32.0
29.0 29.0
32.0 26.0
33.0 27.0
36.0 24.0
38.0 22.0
41.0 19.0
43.0 17.0
45.0 15.0
47.0 13.0
48.0 13.0
51.0 10.0
53.0 8.0
54.0 7.0
56.0 7.0
58.0 5.0
60.0 8.0
61.0 7.0
62.0 6.0
63.0 6.0
66.0 3.0
68.0 1.0
69.0 6.0
75.0 0.0
So if a player lives until age 62, we expect 6 more months left (6 months being the median time to death, given the player lived to 62). That may help as well.
Best Answer
That's not correct. A survival model has to incorporate the contributions of censored observations to the likelihood. Otherwise there is substantial risk of bias.
Location-scale survival modeling in this example is based on the form:
$$\log T = \eta + \sigma W ,$$
where $T$ is survival time, $\eta$ is a location parameter (the linear predictor as a function of covariate values), and $\sigma$ is a scale parameter (represented as
s
in your code, I think).$W$ has a probability distribution corresponding to the underlying parametric model. For the Weibull model in your example, $W$ is standard minimum-extreme-value (one type of Gumbel).
The 2 lines of code that you rightly note to be the heart of the example describe the contributions to the likelihood separately for uncensored (first line) and right-censored (second line) survival times. This page summarizes contributions to likelihood for various types of censoring and truncation in survival analysis.
An uncensored observation contributes a factor proportional to the density of the lifetime distribution at its observation time. That's what your first line of code describes.
Provided that censoring isn't informative, a right-censored observation contributes a factor proportional to the survival function of the lifetime distribution at the (censored) observation time. (A right-censored observation provides information that an individual survived at least that long, which is the survival function of the distribution.) That's what your second line of code represents. In this case there is no modeling of censoring times, just an incorporation of likelihood contributions from censored observations into the model.
The
Potential
object provides a way to incorporate such likelihoods into the model without having to define an entireDensityDist
. This page provides further information.In this way both uncensored and right-censored observation times provide their contributions to the likelihood. As there is no modeling of the censoring times per se, your sampling from the posterior is simply based on the probability distribution of uncensored lifetimes. There is no need to sample censored observation times, which in this case are only used to construct the likelihood.