Suppose I have data in R that looks like this. This data represents measurements of different patients over a period of time (discrete level):
df <- data.frame(patient_id = c(111,111,111, 111, 222, 222, 222),
year = c(2010, 2011, 2012, 2013, 2011, 2012, 2013),
gender = c("Male", "Male", "Male", "Male", "Female", "Female", "Female"),
weight = c(98, 97, 102, 105, 87, 81, 83),
state_at_year = c("healthy", "sick", "sicker", "sicker", "healthy", "sicker", "sicker"))
patient_id year gender weight state_at_year
1 111 2010 Male 98 healthy
2 111 2011 Male 97 sick
3 111 2012 Male 102 sicker
4 111 2013 Male 105 sicker
5 222 2011 Female 87 healthy
6 222 2012 Female 81 sicker
7 222 2013 Female 83 sicker
I am interested in modelling the effect of different patient characteristics on how they transition between different states. To accomplish this, I am thinking of using Discrete Time Markov Cohort Models. Specifically, I am thinking of using the approach provided here (https://hesim-dev.github.io/hesim/articles/mlogit.html) in which:
- All rows of data are isolated in which the patient starts at state k = 1
- Then, a Multinomial Logistic Regression is fit on this dataset (note: since we are modelling the probability of transition based on the information we know BEFORE the transition, for any given row – the weight_end variable is never directly modelled)
- This process is repeated for all other "k" states (excluding recurrent states, e.g. "death)
- As a result, a series of Multinomial Logistic Regression Models are being used to estimate the time-dependent transition probabilities all states.
To reformat the data for Discrete Time Markov Cohort Models (https://hesim-dev.github.io/hesim/articles/mlogit.html) – I would have to reformat the data in such a way, such that it represents transitions between states:
patient_id year_start year_end gender_start gender_end state_start state_end weight_start weight_end
1 111 2010 2011 Male Male healthy sick 98 97
2 111 2011 2012 Male Male sick sicker 97 102
3 111 2012 2013 Male Male sicker sicker 102 105
4 222 2011 2012 Female Female healthy sicker 87 81
5 222 2012 2013 Female Female sicker sicker 81 83
structure(list(patient_id = c(111, 111, 111, 222, 222), year_start = c(2010,
2011, 2012, 2011, 2012), year_end = c(2011, 2012, 2013, 2012,
2013), gender_start = c("Male", "Male", "Male", "Female", "Female"
), gender_end = c("Male", "Male", "Male", "Female", "Female"),
state_start = c("healthy", "sick", "sicker", "healthy", "sicker"
), state_end = c("sick", "sicker", "sicker", "sicker", "sicker"
), weight_start = c(98, 97, 102, 87, 81), weight_end = c(97,
102, 105, 81, 83)), row.names = c(1L, 2L, 3L, 4L, 5L), class = "data.frame")
It appears as though there is no way to but to eliminate the last row of data for each patient – as this will be the last available transition for that patient. This means, that we will be forced to lose one row of data for each patient.
In cases where the patient experiences an absorbing event (e.g. death) – in these cases, this is not a problem. However, in cases where the patient is "right censored" (i.e. has the event after the end of the study) – there is nothing we can do to account for censoring other than removing the last row of data for each patient. We could try to use some imputation method or assume that patient transitions to the same state that they are currently in – but this is a risky process. As such, it seems like there is no option but to discard the last available row of data (i.e. weight_end value occurring at the last row) for each patient and only keep all complete transitions for each patient.
Is my understanding of this correct?
Best Answer
If there are any events (state transitions) in the data set during the last shared time interval, then the "last rows" of individuals who don't have the event during that interval should not be "discarded." They still contribute to the information for building the model and are not "discarded" by the modeling process, either.
Consider a simple 2-state alive/dead transition model. The discrete-time model for that last time interval is a binomial regression of a transition to death versus no transition. You need to keep track of the number of no-transition cases to evaluate the probability of a transition during that last time interval. The argument extends to multinomial regression for multi-state models.
If there are no events at all during that last time interval, then there is no hazard of an event during that time interval. You might have a situation like that in the answer you cite in a comment. If each discrete-time interval is modeled with a separate intercept for a baseline hazard, then data from the last time interval will provide no information for event probabilities during that interval. Those last-interval data points might nevertheless contribute some information for some models that fit the baseline hazard over time to a smoothed form. With a parametric model, such cases with aright-censored transition time provide a likelihood contribution proportional to the survival curve up to the right-censoring time.