Bayesian – Using pm.Potential in Pymc3 for Survival Analysis Examples

bayesianpymc3survival

I am working through the pymc3 Bayesian Survival Analysis example at the link below and I'm struggling to understand their use of pm.Potential:

https://docs.pymc.io/en/stable/pymc-examples/examples/survival_analysis/bayes_param_survival_pymc3.html

The heart of the example comes down to the following two snippets of code:

y_obs = pm.Gumbel("y_obs", η[~cens_], s, observed=y_std[~cens])
y_cens = pm.Potential("y_cens", gumbel_sf(y_std[cens], η[cens_], s))

Basically what is being done is they are modeling observed survival as a Gumbel distribution and censoring as a Gumbel survival function. In general that sort of makes sense to me. I do think I understand how observed deaths and observed censoring mathematically relate to these distributions. I just don't understand what pm.Potential is doing. In fact, if I take it out it doesn't appear to affect anything. Is it there to just calculate y_cens? If so, how do I access that value since I don't see it in the trace output. At this moment, I'm thinking the censored values are contributing nothing to this model and the fitted survival data is entirely based on a gumbel distribution fit of observed deaths. However, I also suspect, if I could figure out how to plot y_cens, it would also be able to give me a very similar estimate of survival trends.

Can anyone educate me on the function of pm.Potential in the above example?

Best Answer

Basically what is being done is they are modeling observed survival as a Gumbel distribution and censoring as a Gumbel survival function... I'm thinking the censored values are contributing nothing to this model and the fitted survival data is entirely based on a gumbel distribution fit of observed deaths.

That's not correct. A survival model has to incorporate the contributions of censored observations to the likelihood. Otherwise there is substantial risk of bias.

Location-scale survival modeling in this example is based on the form:

$$\log T = \eta + \sigma W ,$$

where $T$ is survival time, $\eta$ is a location parameter (the linear predictor as a function of covariate values), and $\sigma$ is a scale parameter (represented as s in your code, I think).

$W$ has a probability distribution corresponding to the underlying parametric model. For the Weibull model in your example, $W$ is standard minimum-extreme-value (one type of Gumbel).

The 2 lines of code that you rightly note to be the heart of the example describe the contributions to the likelihood separately for uncensored (first line) and right-censored (second line) survival times. This page summarizes contributions to likelihood for various types of censoring and truncation in survival analysis.

An uncensored observation contributes a factor proportional to the density of the lifetime distribution at its observation time. That's what your first line of code describes.

Provided that censoring isn't informative, a right-censored observation contributes a factor proportional to the survival function of the lifetime distribution at the (censored) observation time. (A right-censored observation provides information that an individual survived at least that long, which is the survival function of the distribution.) That's what your second line of code represents. In this case there is no modeling of censoring times, just an incorporation of likelihood contributions from censored observations into the model.

The Potential object provides a way to incorporate such likelihoods into the model without having to define an entire DensityDist. This page provides further information.

In this way both uncensored and right-censored observation times provide their contributions to the likelihood. As there is no modeling of the censoring times per se, your sampling from the posterior is simply based on the probability distribution of uncensored lifetimes. There is no need to sample censored observation times, which in this case are only used to construct the likelihood.

Related Solutions

Solved – Right-censored survival fit with JAGS

I was asked to re-post this answer here from my comment at http://doingbayesiandataanalysis.blogspot.com/2012/01/complete-example-of-right-censoring-in.html The specifics of this answer relate to the model in that comment, but the concepts apply to the topic here.

The core of the JAGS model for censored data is this:

isCensored[i] ~ dinterval( y[i] , censorLimitVec[i] )
y[i] ~ dnorm( mu , tau )

The key to understanding what JAGS is doing is that JAGS automatically imputes a random value for any variable that is not specified as a constant in the data. Thus, when y[i] is NA (i.e., a missing value, not a constant), then JAGS imputes a random value for it.

But what value should it generate?

The second line of the model, above, says that y[i] should be randomly generated from a normal distribution with mean mu and precision tau.

But the first line of the model, above, puts another constraint on the randomly generated value of y[i]. That line says that whatever value of y[i] is randomly generated, it must fall on the side of censorLimitVec[i] dictated by the value of isCensored[i].

To understand this part, let's unpack the dinterval() distribution. Suppose that censorLimitVec has 3 values in it, not just 1:

censorLimitVec = c(10,20,30)

Then randomly generated values from dinterval(y,c(10,20,30)) will be either 0, 1, 2, or 3 depending on whether $y<10$, $10 < y < 20$, $20<y<30$, or $30<y$. So, if $y=15$, dinterval(y,c(10,20,30)) has output of $1$ with 100% probability. The trick is this: We instead specify the output of dinterval, and impute a random value of y that could produce it. Thus, if we say

1 ~ dinterval(y,c(10,20,30))

then y is imputed as a random value between 10 and 20.

Putting the two model statements together,

1 ~ dinterval( y , censorLimit )

y ~ dnorm( mu , tau )

means that y comes from a normal density and y must fall above the censorLimit.

Hope that helps!!

Solved – How to get the survival duration prediction for each individual in the data by using the Kaplan-Meier method

👋Hi author of lifelines here. So what you asking is possible. The Kaplan-Meier curve gives you is a distribution of possible durations, where duration is the time between birth and death. However, given a player has existed for $N$ months, you can condition the survival function on $T > N$ to get a better estimate.

Let $S(t) = P(T \ge t)$ be the survival function. We are curious about $S(t | T \ge N)$.

$$ S(t | T\ge N) = \frac{P(T \ge t \text{ and } T \ge N)}{P(T \ge N)} = \frac{P(T \ge t)}{S(N)} = \frac{S(t)}{S(N)},\;\; t \ge N $$

So we simply need to divide the survival function by itself evaluated at the duration seen thus far.

In your use case, you could do something like:

predictions = pd.DataFrame(index=kmf.survival_function_.index)

for ix, row in alive_individuals.iterrows():
    T = row['T']
    predictions[ix] = kmf.survival_function_/kmf.survival_function_.loc[T]

# can't have probabilities great than 1. 
predictions[predictions > 1.0] = 1.

This gives you the new survival function. However, in lifelines, there is another utility you can use. kmf.conditional_time_to_event_ computes these conditional survival functions and then takes the median time remaining. Output using some fake data I have:

          KM_estimate - Conditional time remaining to event
timeline
0.0                                                    56.0
6.0                                                    50.0
7.0                                                    49.0
9.0                                                    47.0
13.0                                                   43.0
15.0                                                   41.0
17.0                                                   39.0
19.0                                                   37.0
22.0                                                   34.0
26.0                                                   32.0
29.0                                                   29.0
32.0                                                   26.0
33.0                                                   27.0
36.0                                                   24.0
38.0                                                   22.0
41.0                                                   19.0
43.0                                                   17.0
45.0                                                   15.0
47.0                                                   13.0
48.0                                                   13.0
51.0                                                   10.0
53.0                                                    8.0
54.0                                                    7.0
56.0                                                    7.0
58.0                                                    5.0
60.0                                                    8.0
61.0                                                    7.0
62.0                                                    6.0
63.0                                                    6.0
66.0                                                    3.0
68.0                                                    1.0
69.0                                                    6.0
75.0                                                    0.0

So if a player lives until age 62, we expect 6 more months left (6 months being the median time to death, given the player lived to 62). That may help as well.

Best Answer

Related Solutions

Solved – Right-censored survival fit with JAGS

Solved – How to get the survival duration prediction for each individual in the data by using the Kaplan-Meier method

Related Question