Cox Model – Confirming Censoring State Transitions with Multi-State Cox Model

cox-modelmulti-staterrecurrent-eventssurvival

I am working with data that contains information about the age at which each person enrolled in a study; zero or more hospitalization events for a disease (with age information); the age at which they should be censored; and (if relevant) the age of their death.

This seems to be fairly amenable to a multi-state model with R's survival package, because it yields estimates of the association between risk factors and the transition from health to (first) disease, as well as the risk factors for transitioning between disease and disease (i.e., recurrent disease). As an example of the model construction that yields this information (not runnable):


cm <- coxph(
    Surv(time = interval_start_age_years,
         time2 = interval_end_age_years, 
         event = status_end,
         origin = enroll_age_years,
        ) 
    ~
    sex +
    start_age_years + 
    cluster(sample_id),
    id = sample_id,
    method="breslow", 
    data=series
)

With this approach, which uses a data set that includes a final entry for each person that transitions them to the censored state (if they did not die), I get estimates for quantities that seem relevant:

Transition from baseline state to disease
Transition from disease to disease (which I interpret as re-hospitalization)
Transition from baseline state to death
Transition from disease to death.

Currently, my baseline/disease-free state is 0, and my censoring state is 0. So someone who gets hospitalized for disease (and then does not die, and is therefore censored at the end of their follow-up time) will transition from 0->1->0 (one entry for 0->1, and a second for 1->0). This final transition from 1->0 seems a bit odd. That said, no value shows up for this 1->0 transition in the coxph result, so it seems to be handled as expected, i.e., as censoring.

Are my interpretations correct? Particularly with respect to: (1) the meaning of the 1->1 model estimates as recurrent events, and (2) how censoring is being handled in the multi-state model?

=== Edit: more info, as requested ===

The survreg transitions output is below, showing sensible (to me) transitions (e.g., Died is an absorbing state):

         to
from      Disease   Died (censored)
  (s0)      26289  26689     425990
  Disease   72176   5414      20875
  Died          0      0          0

Example data:

sample_id   enroll_age_years    start_age_years end_age_years   status_start    status_end_factor   sex incident_number
A   63.97262    63.97262    75.43600    NoDisease   Disease Male    0
A   63.97262    75.43600    75.49076    Disease Disease Male    1
A   63.97262    75.49076    77.29227    Disease NoDisease   Male    2
B   56.72827    56.72827    66.35729    NoDisease   Disease Female  0
B   56.72827    66.35729    67.70431    Disease Died    Female  1

Model output:

Call:
coxph(formula = Surv(time = start_age_years, time2 = end_age_years, 
    event = status_end_factor, origin = enroll_age_years, ) ~ 
    sex + start_age_years + strata(incident_number), data = series, 
    method = "breslow", id = sample_id, cluster = sample_id)

  n= 577433, number of events= 130568 

                         coef exp(coef)  se(coef) robust se       z Pr(>|z|)
sexMale_1:2         0.6844315 1.9826445 0.0127074 0.0127150  53.829   <2e-16
start_age_years_1:2 0.1111991 1.1176174 0.0010152 0.0010176 109.275   <2e-16
sexMale_2:2         0.0474327 1.0485756 0.0079050 0.0145284   3.265   0.0011
start_age_years_2:2 0.0449391 1.0459641 0.0007040 0.0013932  32.255   <2e-16
sexMale_1:3         0.4790696 1.6145715 0.0123618 0.0123406  38.821   <2e-16
start_age_years_1:3 0.0912255 1.0955160 0.0009407 0.0009581  95.215   <2e-16
sexMale_2:3         0.2773273 1.3195982 0.0293437 0.0315166   8.799   <2e-16
start_age_years_2:3 0.0926541 1.0970822 0.0028743 0.0033688  27.504   <2e-16
                       
sexMale_1:2         ***
start_age_years_1:2 ***
sexMale_2:2         ** 
start_age_years_2:2 ***
sexMale_1:3         ***
start_age_years_1:3 ***
sexMale_2:3         ***
start_age_years_2:3 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

                    exp(coef) exp(-coef) lower .95 upper .95
sexMale_1:2             1.983     0.5044     1.934     2.033
start_age_years_1:2     1.118     0.8948     1.115     1.120
sexMale_2:2             1.049     0.9537     1.019     1.079
start_age_years_2:2     1.046     0.9561     1.043     1.049
sexMale_1:3             1.615     0.6194     1.576     1.654
start_age_years_1:3     1.096     0.9128     1.093     1.098
sexMale_2:3             1.320     0.7578     1.241     1.404
start_age_years_2:3     1.097     0.9115     1.090     1.104

Concordance= 0.713  (se = 0.001 )
Likelihood ratio test= 37386  on 8 df,   p=<2e-16
Wald test            = 27042  on 8 df,   p=<2e-16
Score (logrank) test = 34066  on 8 df,   p=<2e-16,   Robust = 24971  p=<2e-16

  (Note: the likelihood ratio and score tests assume independence of
     observations within a cluster, the Wald and robust score tests do not).
```

Best Answer

This seems to do what you aimed to do. A few thoughts.

One fine point is that the name "NoDisease" you gave to the reference level of your status_end_factor was irrelevant. Unless you specify otherwise with an istate argument, the software assigns all individuals to a starting state (s0) as state 1 in the model. That's different from the censoring indicator, which is always taken as the reference level of an event indicator that is a factor. Having a factor level called "NoDisease" as your reference level for status_end_factor doesn't mean a transition from a "Disease" to a true "NoDisease" state as you seem to fear. The software just treats a row with the reference level for the event indicator as a right-censored stop time (regardless of that level's name) for whatever state the individual started in.

The transitions all seem to make sense. You have transitions from (s0) to each of "Disease" (transitions 1:2) and "Death" (transition 1:3), a transition from "Disease" to "Death" (transition 2:3), and transitions from "Disease" back to "Disease" (transition 2:2). The matrix of transition numbers is internally consistent. You've modeled separate covariate coefficients for each of the transitions. Stratifying on the number of hospitalizations provides a start on accounting for that issue.

I don't have experience with the origin argument to Surv() so you should check on a small data set that it does what you intend, which seems to be to set the time = 0 reference for all cases to be the age at study entry. Then you should also double-check whether that choice makes sense in terms of the goals of your study. Selecting the time = 0 reference can be tricky.

With this large a data set you might consider something other than a linear association of start_age_years with transitions. Modeling with splines could provide much more flexible and realistic estimates of the associations between age and transitions.

Two thoughts about time-varying covariates.

First, with a linear association between start_age_years and log-hazards, a time-dependent age variable doesn't add anything. If you model start_age_years with splines as I suggest, it will. See Section 5 of the time-dependence vignette.

Second, there's no reason in principle why you can't use this same counting-process format to handle both the covariates and the transitions. You might be able to get around the problem with time-varying covariates that you report in a comment by taking advantage of the istate argument to coxph(), which specifies "the current state at the start each interval" according to the manual page. I don't have experience with that, however.

Related Solutions

Solved – Modelling mortality rates using Poisson regression

Without seeing the dataset (not available) it seems mostly correct. The nice thing about Poisson regressions is that they can provide rates when used as suggested. One thing that may be worth to keep in mind is that there may be overdispersion where you should switch to a negative binomial regression (see the MASS package).

The Poisson regression doesn't care whether the data as aggregated or not, but in practice non-aggregated data is frail and can cause some unexpected errors. Note that you cannot have surv == 0 for any of the cases. When I've tested the estimates are the same:

set.seed(1)
n <- 1500
data <- 
  data.frame(
    dead = sample(0:1, n, replace = TRUE, prob = c(.9, .1)),
    surv = ceiling(exp(runif(100))*365),
    gender = sample(c("Male", "Female"), n, replace = TRUE),
    diagnosis = sample(0:1, n, replace = TRUE),
    age = sample(60:80, n, replace = TRUE),
    inclusion_year = sample(1998:2011, n, replace = TRUE)
  )

library(dplyr)
model <- 
  data %>% 
  group_by(gender, 
           diagnosis,
           age,
           inclusion_year) %>% 
  summarise(Deaths = sum(dead),
            Person_time = sum(surv)) %>%
  glm(Deaths ~ gender + diagnosis + I(age - 70) + I(inclusion_year - 1998) + offset(log(Person_time/10^3/365.25)), 
      data = . , family = poisson)

alt_model <- glm(dead ~ gender + diagnosis + I(age - 70) + I(inclusion_year - 1998) + offset(log(surv/10^3/365.25)), 
    data = data , family = poisson)
sum(coef(alt_model) - coef(model))
# > 1.779132e-14
sum(abs(confint(alt_model) - confint(model)))
# > 6.013114e-11

As you get a rate it is important to center the variables so that the intercept is interpretable, e.g.:

> exp(coef(model)["(Intercept)"])
(Intercept) 
    51.3771

Can be interpreted as the base rate and then the covariates are rate ratios. If we want the base rate after 10 years:

> exp(coef(model)["(Intercept)"] + coef(model)["I(inclusion_year - 1998)"]*10)
(Intercept) 
     47.427

I've currently modeled the inclusion year as a trend variable but you should probably check for nonlinearities and sometimes it is useful to do a categorization of the time points. I used this approach in this article:

D. Gordon, P. Gillgren, S. Eloranta, H. Olsson, M. Gordon, J. Hansson, and K. E. Smedby, “Time trends in incidence of cutaneous melanoma by detailed anatomical location and patterns of ultraviolet radiation exposure: a retrospective population-based study,” Melanoma Res., vol. 25, no. 4, pp. 348–356, Aug. 2015.

r – Multi-State Cox Models: Different Formulas for Various Transitions Explored

This is not a direct solution to your issue, but you could try using the mstate package.

After transforming data in long format with msprep, using expcovs you can create and add to the dataset transition-specific covariates, which you can then separately include in your model.

Considering the 4 transitions reported above, your Cox model should look something like this: coxph(Surv(Tstart, Tstop, status) ~ cov.1 + cov.2 + cov.3 + cov.4 + dayslasthosp.2 + dayslasthosp.4 + strata(trans), data=msdata), where .j identifies covariates specific to transition j. The covariate 'days since last hospitalization' (dayslasthosp) is included only for transitions 2 and 4.

The vignettes and tutorials available with the package are very informative, and guide you step-by-step through cases like this.

Best Answer

Related Solutions

Solved – Modelling mortality rates using Poisson regression

r – Multi-State Cox Models: Different Formulas for Various Transitions Explored

Related Question