Solved – Basic questions about discrete time survival analysis

discrete datahazardkaplan-meiersurvival

I am attempting to carry out a discrete time survival analysis using a logistic regression model, and I'm not sure I completely understand the process. I would greatly appreciate assistance with a few basic questions.

Here is the set up:

I'm looking at membership in a group within a five year time window. Each member has a monthly record of membership for each month that member is in the group. I'm considering all the members whose membership began during the five year window (to avoid "left censorship" issues with members who joined earlier). Each record will be indexed by time, with time one being the month the member joined. So, a member who stays for two and a half years will have thirty monthly records, numbered from one up to thirty. Each record will also be given a binary variable, which will have a value of one for the last month of membership, and zero otherwise; a value of one for the binary variable marks the event that the member has left the group. For each member whose membership continues beyond the five year analysis window, all the binary variable values will be zero (these are the right-censored individuals in the survival analysis).

So, the logistic regression model is built to predict the values of the binary event variable. So far, so good. One of the typical ways to evaluate a binary predictive model
is to measure the lift on a holdout sample. For the logistic regression model I have built to predict the membership ending event, I have computed the lift on a holdout data set with a five to one ratio of non-events to events. I ranked the predicted values into deciles. The decile with the highest predicted values contains seventy percent ones, a lift of more than four. The first two deciles combined contain sixty-five percent of all the ones in the holdout. In certain contexts this would be considered a fairly decent predictive model, but I wonder whether it's good enough to carry out a survival analysis.

Let $h[j,k]$ be the hazard function for individual $j$ in month $k$, and let $S[j,k]$ be the probability that individual $j$ survives through month $k$.

Here are my fundamental questions:

  1. Is the discrete hazard function, $h[j,k]$, the conditional probability of non-survival (leaving the group) in each month?

  2. Are the predicted values from the logistic regression model estimates of the hazard function? (i.e., is $h[j,k]$ equal to the model predicted value for individual $j$ in month $k$, or does something more need to be done to obtain hazard function estimates?)

  3. Is the probability of survival up to month q for individual $j$ equal to the product of one minus the hazard function from month one up to $q$, that is, does
    $S[j,q] = (1 – h[j,1]) \cdot (1 – h[j,2]) \cdot \ldots \cdot (1 – h[j,q])$?

  4. Is the mean value of $S[j,k]$ over all individuals $j$ for each time $k$ a reasonable estimate of the overall population mean survival probability?

  5. Should a plot of the overall population mean survival probability by month resemble the monthly Kaplan-Meier graph?

If the answer to any of these questions is no, then I have a serious misunderstanding, and could really use some assistance / explanation. Also, is there any rule of thumb for how good the binary predictive model needs to be in order to produce an accurate survival profile?

Best Answer

Assume $K$ is the largest value of $k$ (i.e. the largest month/period observed in your data).

  1. Here is the hazard function with a fully discrete parametrization of time, and with a vector of parameters $\mathbf{B}$ a vector of conditioning variables $\mathbf{X}$: $h_{j,k} = \frac{e^{\alpha_{k} + \mathbf{BX}}}{1 + e^{\alpha_{k} + \mathbf{BX}}}$. The hazard function may also be built around alternative parameterizations of time (e.g. include $k$ or functions of it as a variable in the model), or around a hybrid of both.

    The baseline logit hazard function describes the probability of event occurrence in time $k$, conditional upon having survived to time $k$. Adding predictors ($\mathbf{X}$) to the model further constrains this conditionality.

  2. No, logistic regression estimates (e.g. $\hat{\alpha}_{1}$, $\dots$, $\hat{\alpha}_{K}$, $\mathbf{\hat{B}}$) are not the hazard functions themselves. The logistic regression models: logit$(h_{j,k}) = \alpha_{k} + \mathbf{BX}$, and you need to perform the anti-logit transform in (1) above to get the hazard estimates.

  3. Yes. Although I would notate it $\hat{S}_{j,q} = \prod_{i=1}^{q}{(1-h_{j,i})}$. The survival function is the probability of not experiencing the event by time $k$, and of course may also be conditioned on $\mathbf{X}$.

  4. This is a subtle question, not sure I have answers. I do have questions, though. :) The sample size at each time period decreases over time due to right-censoring and due to event occurrence: would you account for this in your calculation of mean survival time? How? What do you mean by "the population?" What population are the individuals recruited to your study generalizing to? Or do you mean some statistical "super-population" concept? Inference is a big challenge in these models, because we estimate $\beta$s and their standard errors, but need to do delta-method back-flips to get standard errors for $\hat{h}_{j,k}$, and (from my own work) deriving valid standard errors for $\hat{S}_{j,k}$ works only on paper (I can't get correct CI coverages for $\hat{S}_{j,k}$ in conditional models).

  5. You can use Kaplan-Meier-like step-function graphs, and you can also use straight up line graphs (i.e. connect the dots between time periods with a line). You should use the latter case only when the concept of "discrete time" itself admits the possibility of subdivided periods. You can also plot/communicate estimates of cumulative incidence (which is $1 - S_{j,k}$... at least epidemiologists will often define "cumulative incidence" this way, the term is used differently in competing risks models. The term uptake may also be used here.).