Solved – Using random forest for survival analysis with time varying covariates

iidnon-independentrandom forestsurvival

I've been trying to train a model that predicts an individual's survival time.

My training set is an unbalanced panel; it has multiple observations per individual and thus time varying covariates. Every individual is observed from start to finish so no censoring.

As a test, I used a plain random forest regression (not a random survival forest), treating each observation as if it were iid (even if it came from the same individual) with the duration as the target. When testing the predictions on a test set, the results have been surprisingly accurate.

Why is this working so well? I thought random forests needed iid observations.

Best Answer

Although there is structure to your data, it may be that the variation in baseline risk does not vary substantially enough among your subjects to cause a model without a frailty term to form poor predictions. Of course, it's perfectly possible that a model with a frailty term would perform better than the pooled random forest model.

Even if you did run a pooled and hierarchical model and the pooled model did as well or slightly better, you may still want to use the hierarchical model because the variance in baseline risk is very likely NOT zero among your subjects, and the hierarchical model would probably perform better in the long term on data that was in neither your test or training sets.

As an aside, consider whether the cross validation score you are using aligns with the goals of your prediction task in the first place before comparing pooled and hierarchical models. If your goal is to make predictions on the same group of individuals as in your test/training data, then simple k fold or loo cross validation on the response is sufficient. But if you want to make predictions about new individuals, you should instead do k fold cross validation that samples at the individual level. In the first case you are scoring your predictions without regard for the structure of the data. In the second case you are estimating your ability to predict risk within individuals that are not in your sample.

Lastly, remember always that CV is itself data dependent, and only an estimate of your model's predictive capabilities.

Related Question