Regression – Censoring Techniques for Discrete Survival Data

logisticprobabilityregressionsurvival

Suppose I have data in R that looks like this. This data represents measurements of different patients over a period of time (discrete level):

df <- data.frame(patient_id = c(111,111,111, 111, 222, 222, 222), 
                 year = c(2010, 2011, 2012, 2013, 2011, 2012, 2013), 
                 gender = c("Male", "Male", "Male", "Male", "Female", "Female", "Female"), 
                 weight = c(98, 97, 102, 105, 87, 81, 83), 
                 state_at_year = c("healthy", "sick", "sicker", "sicker", "healthy", "sicker", "sicker"))

  patient_id year gender weight state_at_year
1        111 2010   Male     98       healthy
2        111 2011   Male     97          sick
3        111 2012   Male    102        sicker
4        111 2013   Male    105        sicker
5        222 2011 Female     87       healthy
6        222 2012 Female     81        sicker
7        222 2013 Female     83        sicker

I am interested in modelling the effect of different patient characteristics on how they transition between different states. To accomplish this, I am thinking of using Discrete Time Markov Cohort Models. Specifically, I am thinking of using the approach provided here (https://hesim-dev.github.io/hesim/articles/mlogit.html) in which:

  • All rows of data are isolated in which the patient starts at state k = 1
  • Then, a Multinomial Logistic Regression is fit on this dataset (note: since we are modelling the probability of transition based on the information we know BEFORE the transition, for any given row – the weight_end variable is never directly modelled)
  • This process is repeated for all other "k" states (excluding recurrent states, e.g. "death)
  • As a result, a series of Multinomial Logistic Regression Models are being used to estimate the time-dependent transition probabilities all states.

To reformat the data for Discrete Time Markov Cohort Models (https://hesim-dev.github.io/hesim/articles/mlogit.html) – I would have to reformat the data in such a way, such that it represents transitions between states:

 patient_id year_start year_end gender_start gender_end state_start state_end weight_start weight_end
1        111       2010     2011         Male       Male     healthy      sick           98         97
2        111       2011     2012         Male       Male        sick    sicker           97        102
3        111       2012     2013         Male       Male      sicker    sicker          102        105
4        222       2011     2012       Female     Female     healthy    sicker           87         81
5        222       2012     2013       Female     Female      sicker    sicker           81         83

structure(list(patient_id = c(111, 111, 111, 222, 222), year_start = c(2010, 
2011, 2012, 2011, 2012), year_end = c(2011, 2012, 2013, 2012, 
2013), gender_start = c("Male", "Male", "Male", "Female", "Female"
), gender_end = c("Male", "Male", "Male", "Female", "Female"), 
    state_start = c("healthy", "sick", "sicker", "healthy", "sicker"
    ), state_end = c("sick", "sicker", "sicker", "sicker", "sicker"
    ), weight_start = c(98, 97, 102, 87, 81), weight_end = c(97, 
    102, 105, 81, 83)), row.names = c(1L, 2L, 3L, 4L, 5L), class = "data.frame")

It appears as though there is no way to but to eliminate the last row of data for each patient – as this will be the last available transition for that patient. This means, that we will be forced to lose one row of data for each patient.

In cases where the patient experiences an absorbing event (e.g. death) – in these cases, this is not a problem. However, in cases where the patient is "right censored" (i.e. has the event after the end of the study) – there is nothing we can do to account for censoring other than removing the last row of data for each patient. We could try to use some imputation method or assume that patient transitions to the same state that they are currently in – but this is a risky process. As such, it seems like there is no option but to discard the last available row of data (i.e. weight_end value occurring at the last row) for each patient and only keep all complete transitions for each patient.

Is my understanding of this correct?

Best Answer

If there are any events (state transitions) in the data set during the last shared time interval, then the "last rows" of individuals who don't have the event during that interval should not be "discarded." They still contribute to the information for building the model and are not "discarded" by the modeling process, either.

Consider a simple 2-state alive/dead transition model. The discrete-time model for that last time interval is a binomial regression of a transition to death versus no transition. You need to keep track of the number of no-transition cases to evaluate the probability of a transition during that last time interval. The argument extends to multinomial regression for multi-state models.

If there are no events at all during that last time interval, then there is no hazard of an event during that time interval. You might have a situation like that in the answer you cite in a comment. If each discrete-time interval is modeled with a separate intercept for a baseline hazard, then data from the last time interval will provide no information for event probabilities during that interval. Those last-interval data points might nevertheless contribute some information for some models that fit the baseline hazard over time to a smoothed form. With a parametric model, such cases with aright-censored transition time provide a likelihood contribution proportional to the survival curve up to the right-censoring time.

Related Question