Solved – Using random forest for survival analysis with time varying covariates

iidnon-independentrandom forestsurvival

I've been trying to train a model that predicts an individual's survival time.

My training set is an unbalanced panel; it has multiple observations per individual and thus time varying covariates. Every individual is observed from start to finish so no censoring.

As a test, I used a plain random forest regression (not a random survival forest), treating each observation as if it were iid (even if it came from the same individual) with the duration as the target. When testing the predictions on a test set, the results have been surprisingly accurate.

Why is this working so well? I thought random forests needed iid observations.

Best Answer

Although there is structure to your data, it may be that the variation in baseline risk does not vary substantially enough among your subjects to cause a model without a frailty term to form poor predictions. Of course, it's perfectly possible that a model with a frailty term would perform better than the pooled random forest model.

Even if you did run a pooled and hierarchical model and the pooled model did as well or slightly better, you may still want to use the hierarchical model because the variance in baseline risk is very likely NOT zero among your subjects, and the hierarchical model would probably perform better in the long term on data that was in neither your test or training sets.

As an aside, consider whether the cross validation score you are using aligns with the goals of your prediction task in the first place before comparing pooled and hierarchical models. If your goal is to make predictions on the same group of individuals as in your test/training data, then simple k fold or loo cross validation on the response is sufficient. But if you want to make predictions about new individuals, you should instead do k fold cross validation that samples at the individual level. In the first case you are scoring your predictions without regard for the structure of the data. In the second case you are estimating your ability to predict risk within individuals that are not in your sample.

Lastly, remember always that CV is itself data dependent, and only an estimate of your model's predictive capabilities.

Related Solutions

Solved – Using R to calculate survival probabilities with time-varying covariates using an Andersen-Gill model

My line of thought goes like this. As per the AG model, each individual is represented by a counting process $N_i(t)$ with intensity $\lambda_i(t)$ which can be written as $\lambda_i(t) = \lambda_0(t) \exp(\beta' x_i(t))$. By this it is implicit that only the "current" value of $x_i$ matters for the intensity. The counting process grows when an individual has an event. I also assume independent censoring (actually stopping of the process). Note that these assumptions are implicit from the way that you fitted the model.

The probability of no events in $(a,b)$ (improperly, a "survival") is then $$ S_i(t|s) = \exp \left( -\int_s^t \lambda_0(u) \exp(\beta'x_i(u)) du \right) $$ Note that the probability of no events does not directly relate to the probability of one event, as the complementary event to "no events in a period" is "at least one event during a period".

Then say you are interested in $S(t|s_0)$. Then the idea would be to fix $x_i = x_i(s_0)$ (i.e. assume the covariates don't change after $s_0$), and we would get $$ S_i(t|s_0) = S_i(t) / S_i(s_0) $$

Since the part before $s_0$ cancels out, together with my assumption that only the current value of $x_i$ matters for the intensity, means that this is equal to $$ S_i(t|s_i) = \exp \left( -\exp(\beta'x_i(s_0)) \int_{s_0}^t \lambda_0(u)du \right) $$

Then in R you can do it like that (I give an example with a data set):

library(frailtypack)
data(readmission)

mod1 <- coxph(Surv(t.start, t.stop, event) ~ sex + charlson + cluster(id), data = readmission)

# set the covariate values at s_0
mycov <- data.frame(sex = "Female", charlson = "1-2")
sf <- survfit(mod1, newdata = mycov)

# with different s_0
par(mfrow=c(2,2))
time <- c(300, 500, 700, 1000)
for(i in time) {
  pos <- which.max(sf$time[sf$time <= i])
  S_s0 <- sf$surv[pos]

  with(sf, plot(time[pos:length(time)], surv[pos:length(surv)] / S_s0), type = "l")
}

Here you get the plots of the survival curves, which correspond to the values $(t, S(t|s_0))$ for $t\geq s_0$.

The other comments that I would give on this are the following: it is difficult to talk about the distribution of the next event. This is because in the AG formulation the time since previous event does not play any role. In other words, if you would like to take that into account, more complicated stochastic models should be used, where for example you include the "previous number of events" as a time-dependent covariate. This does complicate things a lot and the interpretation of the estimated quantities is most likely very difficult.

The second comment I have is about the nature of the time dependent covariates. Mostly, the AG works nicely with "external" covariates, such as air pollution, or something which is not directly measured on the subject ("external" of the recurrent event process). This is mostly because the first expression I wrote here is the probability of no events during $(a,b)$ relies on the assumption that the number of events in any given interval is Poisson distributed. This is true if the covariates are external. A discussion on this can be found in several textbooks, for example in Cook & Lawless at section 2.5. If your time-dependent covariates do depend on the recurrent event process, then it should be modeled jointly with the recurrent events process.

Solved – How to use survival model to predict when there are time dependent continuous covariates

I start with answering question 2.

In the pec package there is a function called predictSurvProb which will do exactly what you want. This function requires a coxph object (your fit2), a new data frame (your pbc.test) and lastly the point(s) in time of which you want the survival probability. Note that i said point(S), you can also give c(1,2,3,4) as argument to get the survival probability at time points 1, 2, 3 and 4. It gives the survival probability for each individual at the given time points.

I think question 1. has the same answer as question 2. But use the training set (pbc.train) as your new data frame. So do predictSurvProb(fit2, pbc.train, 10000) and this will deliver the desired results. I am not entirely sure if this function can handle points in time not yet happened.

Your side question. If the persons are already dead, they do not provide any added information (not sure), so I think it is safe to delete them. However, you can also run the code with and without them and check the results. If they differ, I am inclined to think they still provide some information and then it would be inappropriate to delete them.

Best Answer

Related Solutions

Solved – Using R to calculate survival probabilities with time-varying covariates using an Andersen-Gill model

Solved – How to use survival model to predict when there are time dependent continuous covariates

Related Question