Solved – Using R to calculate survival probabilities with time-varying covariates using an Andersen-Gill model

rrecurrent-eventssurvival

I have been using the cph function of the rms package in R to fit an Andersen-Gill (AG) model for recurrent time to events. I include time-varying covariates in this model as per the 1982 paper from Andersen and Gill – for example I use the "dynamic" covariate recurrent outcome history to model the within-subject dependence in the recurrent events.

Now I wish to calculate survival probabilities for each of the unique observed recurrent event times (unique observed times from all subjects) for each subject. Of particular concern is that I am not using the "id" keyword and so the "cph" function "naively" believes that each row of the dataset is a separate subject. My understanding is that this approach is correct for estimation but I am uncertain if this is correct for survival probabilities.

When I say above that "My understanding is that this approach is correct for estimation" I mean with a "full" set of covariates that explain the within subject dependence, then conditional on these covariates the repeated events within a subject are independent. If some covariates are missing then this does not hold and some dependence will go unexplained. In this situation naively using the "cph" method leads to unbiased estimates but requires the "cluster" command to adjust the standard errors.

My R code to fit the model is as follows where cov1-cov3 are my covariates and subjid is my subject index;

AG_mod = cph(Surv(tstart,tstop,censored)~cov1+cov2+cov3+cluster(subjid),x=TRUE,y=TRUE,data=mydata,surv=TRUE,singular.ok=TRUE)

Then (in pseudo code) I do the following to calculate the survival probabilities;

for each subject

    datai = data for subject i

    surv_fit_i = survfit(AG_mod,newdata=dati,id=subjid,conf.type="none",se.fit=FALSE);   

end

Later update to pseudo code (the above code produces errors)

for each subject
       #get the data for ith subject
       datai = data for subject i

       #get the time intervals as a survival object
       intervals = Surv(dati$tstart,dati$tstop,dati$censored)

        #get the covariate data
        covsi =dati[,c("subjid","cov1","cov2","cov3"),with=FALSE]            

        #combine intervals with covariate data
        covsi = data.frame(covsi,intervals)

        #estimate the survival curve for this subject
        surv_fit = survfit(AG_mod,newdata=covsi,conf.type="none",se.fit=FALSE,id=covsi$subjid,censor=TRUE)           

    end

As far as I can tell the "id=covsi$subjid" statement tells R that multiple records belong to the same subject and so I am guessing that equation (2) below is implemented?

Later update to clarify model

The model I am fitting is an "intensity" model as per the 1982 Andersen-Gill definition

$E[dN_{i}(t)|Y_{i}(t),\mathcal{F}_{t^{-}}]=Y_{i}(t)\alpha_{0}(t)\exp(\beta N_{i}(t^{-}))$

where the history $\mathcal{F}_{t^{-}}$ contains all information about the counting process $N_{i}(t)$ up to "just before" time $t$, and this includes information on fixed and time-varying covariates (including the "dynamic" covariates time since last outcome and number of previous outcomes), and previous censoring events. This intensity model assumes the observed "jumps" in the counting process for subject $i$ at times $t_{1},t_{2},…$, $N_{i}(t_{1}),N_{i}(t_{2}),…$ are conditionally independent given this history $\mathcal{F}_{t^{-}}$. Furthermore since $\mathcal{F}_{t^{-}}$ contains the censoring history then unbiased estimators of $\beta$ can be obtained even if the censoring is informative – all that is required is a weaker assumption called Coarsening at Random (CAR) which requires at any time $t$ that censoring depends only on the observed data (and this could include dependence on the observed outcome history).

In an intensity model the dependence within subjects is modelled with the dynamic covariates so that the quantity $N_{i}(s)-\int_{0}^{t}Y_{i}(s)\alpha_{0}(s)\exp(\beta N_{i}(s^{-}))ds$ is a martingale. The way I think of this is there should be no subject-specific random variation in the residuals since this is captured by the dynamic covariates. In contrast omitting some or all of the dynamic covariates means the quantity $N_{i}(s)-\int_{0}^{t}Y_{i}(s)\alpha_{0}(s)\exp(\beta N_{i}(s^{-}))ds$ in general is not a martingale – this is called a "rate" rather than an intensity model and was descibed by Lin et al 2000. Rate models have unbiased estimators of $\beta$ but require an empirical estimator for the estimator variance. The way I think about this is there will be some unexplained subject-specific variation in the residuals from rate models because we "naively" estimate $\beta$ assuming conditional independence.

If an intensity process does not depend in some way on the counting process history then by definition of the intensity of a counting process the recurrent events must also be independent. In particular if the counting process is a Poisson process the recurrent events must also be independent.

Update regarding survival probabilities

An estimator of the survival probability $S(t|x_{i})$ for covariate vector $x_{i}$ is

\begin{align}
\hat{S}(t|x_{i})=\prod_{T_{j}\leq t}\{1-\Delta\hat{A}(T_{j}|x_{i})\},
\end{align}

where $\Delta\hat{A}(T_{j}|x_{i})=1/Y(T_{j})$ is an estimator of the "increment" of the cumulative hazard $\hat{A}(t|x_{i})$. For a fixed covariate vector $x_{i}(s')$, for a point $s'\in(0,t]$, it is sometimes not meaningful to calculate the cumulative hazard in the time interval $[0,t]$, but nonetheless this can be accomplished by

\begin{align}
\hat{A}(t|x_{i})=\exp(\hat{\beta} x_{i}(s'))\int_{0}^{t}d\hat{A}_{0}(u),\hspace{200pt}(1)
\end{align}

where

\begin{align*}
\hat{A}_{0}(t)&=\int_{0}^{t}\hat{\alpha}_{0}(u)du\\
&=\sum_{T_{j}\leq t}\frac{1}{\sum\nolimits_{l\in\mathcal{R}_{j}}\exp(\hat{\beta} x_{l}(T_{j}))},
\end{align*}

where $\mathcal{R}_{j}:=\{l:Y_{l}(T_{j})=1\}$ is the risk set of subject indices at observed event time $T_{j}$.

Alternatively the cumulative hazard corresponding to the whole treatment path $x_{i}(s)$ for $0<s\leq t$ can make more sense and can be calculated as

\begin{align}
\hat{A}(t|x_{i})=\int_{0}^{t}\exp(\hat{\beta} x_{i}(u))d\hat{A}_{0}(u),\hspace{200pt}(2)
\end{align}

It is equation (2) that I believe makes most sense to calculate at any given $t$, but what I am wondering is exactly how is this performed in R?

Best Answer

My line of thought goes like this. As per the AG model, each individual is represented by a counting process $N_i(t)$ with intensity $\lambda_i(t)$ which can be written as $\lambda_i(t) = \lambda_0(t) \exp(\beta' x_i(t))$. By this it is implicit that only the "current" value of $x_i$ matters for the intensity. The counting process grows when an individual has an event. I also assume independent censoring (actually stopping of the process). Note that these assumptions are implicit from the way that you fitted the model.

The probability of no events in $(a,b)$ (improperly, a "survival") is then $$ S_i(t|s) = \exp \left( -\int_s^t \lambda_0(u) \exp(\beta'x_i(u)) du \right) $$ Note that the probability of no events does not directly relate to the probability of one event, as the complementary event to "no events in a period" is "at least one event during a period".

Then say you are interested in $S(t|s_0)$. Then the idea would be to fix $x_i = x_i(s_0)$ (i.e. assume the covariates don't change after $s_0$), and we would get $$ S_i(t|s_0) = S_i(t) / S_i(s_0) $$

Since the part before $s_0$ cancels out, together with my assumption that only the current value of $x_i$ matters for the intensity, means that this is equal to $$ S_i(t|s_i) = \exp \left( -\exp(\beta'x_i(s_0)) \int_{s_0}^t \lambda_0(u)du \right) $$

Then in R you can do it like that (I give an example with a data set):

library(frailtypack)
data(readmission)

mod1 <- coxph(Surv(t.start, t.stop, event) ~ sex + charlson + cluster(id), data = readmission)

# set the covariate values at s_0
mycov <- data.frame(sex = "Female", charlson = "1-2")
sf <- survfit(mod1, newdata = mycov)

# with different s_0
par(mfrow=c(2,2))
time <- c(300, 500, 700, 1000)
for(i in time) {
  pos <- which.max(sf$time[sf$time <= i])
  S_s0 <- sf$surv[pos]

  with(sf, plot(time[pos:length(time)], surv[pos:length(surv)] / S_s0), type = "l")
}

Here you get the plots of the survival curves, which correspond to the values $(t, S(t|s_0))$ for $t\geq s_0$.

The other comments that I would give on this are the following: it is difficult to talk about the distribution of the next event. This is because in the AG formulation the time since previous event does not play any role. In other words, if you would like to take that into account, more complicated stochastic models should be used, where for example you include the "previous number of events" as a time-dependent covariate. This does complicate things a lot and the interpretation of the estimated quantities is most likely very difficult.

The second comment I have is about the nature of the time dependent covariates. Mostly, the AG works nicely with "external" covariates, such as air pollution, or something which is not directly measured on the subject ("external" of the recurrent event process). This is mostly because the first expression I wrote here is the probability of no events during $(a,b)$ relies on the assumption that the number of events in any given interval is Poisson distributed. This is true if the covariates are external. A discussion on this can be found in several textbooks, for example in Cook & Lawless at section 2.5. If your time-dependent covariates do depend on the recurrent event process, then it should be modeled jointly with the recurrent events process.

Related Question