Survival – Fixed Effects Cox Proportional Hazards Model for Interval Censored Data in R

cox-modelinterval-censoringrobust-standard-errorsurvival

I have a large data set where some observation-periods are right censored (no event observed), others are interval censored (event observed but timing is uncertain), and some events fall into the observation-period.

Specifically:

  • we have individuals with year of birth between 2000 and 2018
  • each individual can have at most one event. There is only one type of event.
  • we know the timing of events that occurred in the period 2009-2018
  • events that occurred before the year 2009 are all registered in 2008. For individuals with birth year before 2009 who also had and event before the year 2009, we therefore only know that an event happened between the year of birth and 2008. (interval censoring)
  • we have no knowledge about events that occurred after 2018 (right censoring)

My understanding is that this data can be described as interval-censored data.

Another important characteristic of the data is that observations are clustered: Events are observed in strata (children in the same family), and the goal is to estimate the effect of within-strata variation of an exposure on occurrence of events with a fixed-effects model. To implement a fixed-effects cox model I need to use stratified baseline hazards.

The challenge is that I would like to do calculate a semi-parametric Cox proportional hazards regression (e.g. coxph from the survival package) with stratified baseline hazards, but AFAIK coxph does not work with survival objects that describe interval censored data. (I could use parametric models implemented in the survreg function, but I'd like to try the cox model first).

I'd be glad about any pointer to packages that allow to estimate cox models with interval censored data while also estimating strata-specific baseline hazards and robust standard errors in R. (As far as I can this is possible in the newest Stata 17).

PS: There is the icenReg package which supports cox regressions for interval-censored data, but I could not find information about stratification and robust standard errors in the package documentation. I therefor assume that this package cannot deal with my data.

Best Answer

It seems that time = 0 for each individual is the birth date, so that the event time is age at the event. You don't seem to have time-varying covariates. With at most 1 event per individual, you don't need to take repeated events into account, although you do need to take correlations among members of the same family into account. In those respects, this is a straightforward survival model.

Left censoring

The practical problem with a Cox model here is that the times to events prior to 2009 are left censored. It's probably best to call them "left censored," even though left censoring is a limiting case of interval censoring with a lower limit of $t=-\infty$. I think that the term "interval censored" more typically is used in situations where individuals are followed up at intervals (e.g., regular visits after cancer treatment) and the event time is known to be at some time between 2 follow-up visits. For a standard Cox model based on risk sets at event times, you don't know which risk sets should include the left-censored cases so you need another approach to proportional-hazards modeling.

Family clusters

In terms of the clustering, you say:

Events are observed in strata (children in the same family), and the goal is to estimate the effect of within-strata variation of an exposure on occurrence of events with a fixed-effects model. To implement a fixed-effects cox model I need to use stratified baseline hazards.

There's a big practical problem in stratifying with separate baselines for each family. That will provide only a few cases within each stratum, leading to difficulty in identifying family-specific baseline hazards and a big loss of power overall.

You might be better off with an unstratified model, using some other way to take within-family correlations into account. Standard approaches are to estimate robust standard errors around the coefficient point estimates (cluster() term in the survival package) or to include a frailty/random-effect term in the model for families.

Parametric versus semi-parametric models

Left-censored data are reasonably easy to handle with parametric models, as the contribution of a case left-censored at time $t_c$ to the likelihood is simply the cumulative distribution through $t_c$, $F(t_c) = 1 - S(t_c)$, the complement of the survival function through that time. With a semi-parametric model and left/interval censoring, you instead need to estimate a baseline hazard (via the Turnbull extension to Kaplan-Meier curves for interval-censored data) jointly with the regression coefficients. That computationally intensive task furthermore requires bootstrapping to get confidence intervals on the coefficient estimates. See the icenReg() documentation.

Although I haven't tried this myself, the icenReg package seems capable of handling a semi-parametric proportional-hazards model along with clustering to account for within-family correlations. You would use id values to represent families rather than individuals. The package has an ir_clustBoot() function that seems to do the bootstrapping equivalent of the cluster() adjustment in survival package models, evidently with cluster-based bootstrapping instead of the default case-based bootstrapping. To test that, you could try a small test data set with only right censoring and compare the results with icenReg functions against what you get with the survival package and cluster() terms.

It might be simplest to use the standard survreg() function in the survival package to try parametric models. Those models directly handle clustered data with a "cluster" argument provided to the function. Although survreg() also will accept a frailty term instead, the main documentation (Section 5.5.3) indicates that frailty terms might not behave properly outside of the Cox regression context.

Related Question