I am working with data that contains information about the age at which each person enrolled in a study; zero or more hospitalization events for a disease (with age information); the age at which they should be censored; and (if relevant) the age of their death.
This seems to be fairly amenable to a multi-state model with R's survival package, because it yields estimates of the association between risk factors and the transition from health to (first) disease, as well as the risk factors for transitioning between disease and disease (i.e., recurrent disease). As an example of the model construction that yields this information (not runnable):
cm <- coxph(
Surv(time = interval_start_age_years,
time2 = interval_end_age_years,
event = status_end,
origin = enroll_age_years,
)
~
sex +
start_age_years +
cluster(sample_id),
id = sample_id,
method="breslow",
data=series
)
With this approach, which uses a data set that includes a final entry for each person that transitions them to the censored state (if they did not die), I get estimates for quantities that seem relevant:
- Transition from baseline state to disease
- Transition from disease to disease (which I interpret as re-hospitalization)
- Transition from baseline state to death
- Transition from disease to death.
Currently, my baseline/disease-free state is 0
, and my censoring state is 0
. So someone who gets hospitalized for disease (and then does not die, and is therefore censored at the end of their follow-up time) will transition from 0->1->0 (one entry for 0->1, and a second for 1->0). This final transition from 1->0 seems a bit odd. That said, no value shows up for this 1->0 transition in the coxph
result, so it seems to be handled as expected, i.e., as censoring.
Are my interpretations correct? Particularly with respect to: (1) the meaning of the 1->1 model estimates as recurrent events, and (2) how censoring is being handled in the multi-state model?
=== Edit: more info, as requested ===
The survreg transitions output is below, showing sensible (to me) transitions (e.g., Died is an absorbing state):
to
from Disease Died (censored)
(s0) 26289 26689 425990
Disease 72176 5414 20875
Died 0 0 0
Example data:
sample_id enroll_age_years start_age_years end_age_years status_start status_end_factor sex incident_number
A 63.97262 63.97262 75.43600 NoDisease Disease Male 0
A 63.97262 75.43600 75.49076 Disease Disease Male 1
A 63.97262 75.49076 77.29227 Disease NoDisease Male 2
B 56.72827 56.72827 66.35729 NoDisease Disease Female 0
B 56.72827 66.35729 67.70431 Disease Died Female 1
Model output:
Call:
coxph(formula = Surv(time = start_age_years, time2 = end_age_years,
event = status_end_factor, origin = enroll_age_years, ) ~
sex + start_age_years + strata(incident_number), data = series,
method = "breslow", id = sample_id, cluster = sample_id)
n= 577433, number of events= 130568
coef exp(coef) se(coef) robust se z Pr(>|z|)
sexMale_1:2 0.6844315 1.9826445 0.0127074 0.0127150 53.829 <2e-16
start_age_years_1:2 0.1111991 1.1176174 0.0010152 0.0010176 109.275 <2e-16
sexMale_2:2 0.0474327 1.0485756 0.0079050 0.0145284 3.265 0.0011
start_age_years_2:2 0.0449391 1.0459641 0.0007040 0.0013932 32.255 <2e-16
sexMale_1:3 0.4790696 1.6145715 0.0123618 0.0123406 38.821 <2e-16
start_age_years_1:3 0.0912255 1.0955160 0.0009407 0.0009581 95.215 <2e-16
sexMale_2:3 0.2773273 1.3195982 0.0293437 0.0315166 8.799 <2e-16
start_age_years_2:3 0.0926541 1.0970822 0.0028743 0.0033688 27.504 <2e-16
sexMale_1:2 ***
start_age_years_1:2 ***
sexMale_2:2 **
start_age_years_2:2 ***
sexMale_1:3 ***
start_age_years_1:3 ***
sexMale_2:3 ***
start_age_years_2:3 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
sexMale_1:2 1.983 0.5044 1.934 2.033
start_age_years_1:2 1.118 0.8948 1.115 1.120
sexMale_2:2 1.049 0.9537 1.019 1.079
start_age_years_2:2 1.046 0.9561 1.043 1.049
sexMale_1:3 1.615 0.6194 1.576 1.654
start_age_years_1:3 1.096 0.9128 1.093 1.098
sexMale_2:3 1.320 0.7578 1.241 1.404
start_age_years_2:3 1.097 0.9115 1.090 1.104
Concordance= 0.713 (se = 0.001 )
Likelihood ratio test= 37386 on 8 df, p=<2e-16
Wald test = 27042 on 8 df, p=<2e-16
Score (logrank) test = 34066 on 8 df, p=<2e-16, Robust = 24971 p=<2e-16
(Note: the likelihood ratio and score tests assume independence of
observations within a cluster, the Wald and robust score tests do not).
```
Best Answer
This seems to do what you aimed to do. A few thoughts.
One fine point is that the name "NoDisease" you gave to the reference level of your
status_end_factor
was irrelevant. Unless you specify otherwise with anistate
argument, the software assigns all individuals to a starting state(s0)
as state1
in the model. That's different from the censoring indicator, which is always taken as the reference level of an event indicator that is a factor. Having a factor level called "NoDisease" as your reference level forstatus_end_factor
doesn't mean a transition from a "Disease" to a true "NoDisease" state as you seem to fear. The software just treats a row with the reference level for the event indicator as a right-censored stop time (regardless of that level's name) for whatever state the individual started in.The transitions all seem to make sense. You have transitions from
(s0)
to each of "Disease" (transitions 1:2) and "Death" (transition 1:3), a transition from "Disease" to "Death" (transition 2:3), and transitions from "Disease" back to "Disease" (transition 2:2). The matrix of transition numbers is internally consistent. You've modeled separate covariate coefficients for each of the transitions. Stratifying on the number of hospitalizations provides a start on accounting for that issue.I don't have experience with the
origin
argument toSurv()
so you should check on a small data set that it does what you intend, which seems to be to set thetime = 0
reference for all cases to be the age at study entry. Then you should also double-check whether that choice makes sense in terms of the goals of your study. Selecting thetime = 0
reference can be tricky.With this large a data set you might consider something other than a linear association of
start_age_years
with transitions. Modeling with splines could provide much more flexible and realistic estimates of the associations between age and transitions.Two thoughts about time-varying covariates.
First, with a linear association between
start_age_years
and log-hazards, a time-dependentage
variable doesn't add anything. If you modelstart_age_years
with splines as I suggest, it will. See Section 5 of the time-dependence vignette.Second, there's no reason in principle why you can't use this same counting-process format to handle both the covariates and the transitions. You might be able to get around the problem with time-varying covariates that you report in a comment by taking advantage of the
istate
argument tocoxph()
, which specifies "the current state at the start each interval" according to the manual page. I don't have experience with that, however.