My line of thought goes like this. As per the AG model, each individual is represented by a counting process $N_i(t)$ with intensity $\lambda_i(t)$ which can be written as $\lambda_i(t) = \lambda_0(t) \exp(\beta' x_i(t))$. By this it is implicit that only the "current" value of $x_i$ matters for the intensity. The counting process grows when an individual has an event. I also assume independent censoring (actually stopping of the process). Note that these assumptions are implicit from the way that you fitted the model.
The probability of no events in $(a,b)$ (improperly, a "survival") is then
$$
S_i(t|s) = \exp \left( -\int_s^t \lambda_0(u) \exp(\beta'x_i(u)) du \right)
$$
Note that the probability of no events does not directly relate to the probability of one event, as the complementary event to "no events in a period" is "at least one event during a period".
Then say you are interested in $S(t|s_0)$. Then the idea would be to fix $x_i = x_i(s_0)$ (i.e. assume the covariates don't change after $s_0$), and we would get
$$
S_i(t|s_0) = S_i(t) / S_i(s_0)
$$
Since the part before $s_0$ cancels out, together with my assumption that only the current value of $x_i$ matters for the intensity, means that this is equal to
$$
S_i(t|s_i) = \exp \left( -\exp(\beta'x_i(s_0)) \int_{s_0}^t \lambda_0(u)du \right)
$$
Then in R you can do it like that (I give an example with a data set):
library(frailtypack)
data(readmission)
mod1 <- coxph(Surv(t.start, t.stop, event) ~ sex + charlson + cluster(id), data = readmission)
# set the covariate values at s_0
mycov <- data.frame(sex = "Female", charlson = "1-2")
sf <- survfit(mod1, newdata = mycov)
# with different s_0
par(mfrow=c(2,2))
time <- c(300, 500, 700, 1000)
for(i in time) {
pos <- which.max(sf$time[sf$time <= i])
S_s0 <- sf$surv[pos]
with(sf, plot(time[pos:length(time)], surv[pos:length(surv)] / S_s0), type = "l")
}
Here you get the plots of the survival curves, which correspond to the values
$(t, S(t|s_0))$ for $t\geq s_0$.
The other comments that I would give on this are the following: it is difficult to talk about the distribution of the next event. This is because in the AG formulation the time since previous event does not play any role. In other words, if you would like to take that into account, more complicated stochastic models should be used, where for example you include the "previous number of events" as a time-dependent covariate. This does complicate things a lot and the interpretation of the estimated quantities is most likely very difficult.
The second comment I have is about the nature of the time dependent covariates. Mostly, the AG works nicely with "external" covariates, such as air pollution, or something which is not directly measured on the subject ("external" of the recurrent event process). This is mostly because the first expression I wrote here is the probability of no events during $(a,b)$ relies on the assumption that the number of events in any given interval is Poisson distributed. This is true if the covariates are external. A discussion on this can be found in several textbooks, for example in Cook & Lawless at section 2.5. If your time-dependent covariates do depend on the recurrent event process, then it should be modeled jointly with the recurrent events process.
I start with answering question 2.
In the pec
package there is a function called predictSurvProb
which will do exactly what you want. This function requires a coxph
object (your fit2
), a new data frame (your pbc.test
) and lastly the point(s) in time of which you want the survival probability. Note that i said point(S), you can also give c(1,2,3,4)
as argument to get the survival probability at time points 1, 2, 3 and 4. It gives the survival probability for each individual at the given time points.
I think question 1. has the same answer as question 2. But use the training set (pbc.train
) as your new data frame. So do predictSurvProb(fit2, pbc.train, 10000)
and this will deliver the desired results. I am not entirely sure if this function can handle points in time not yet happened.
Your side question. If the persons are already dead, they do not provide any added information (not sure), so I think it is safe to delete them. However, you can also run the code with and without them and check the results. If they differ, I am inclined to think they still provide some information and then it would be inappropriate to delete them.
Best Answer
Although there is structure to your data, it may be that the variation in baseline risk does not vary substantially enough among your subjects to cause a model without a frailty term to form poor predictions. Of course, it's perfectly possible that a model with a frailty term would perform better than the pooled random forest model.
Even if you did run a pooled and hierarchical model and the pooled model did as well or slightly better, you may still want to use the hierarchical model because the variance in baseline risk is very likely NOT zero among your subjects, and the hierarchical model would probably perform better in the long term on data that was in neither your test or training sets.
As an aside, consider whether the cross validation score you are using aligns with the goals of your prediction task in the first place before comparing pooled and hierarchical models. If your goal is to make predictions on the same group of individuals as in your test/training data, then simple k fold or loo cross validation on the response is sufficient. But if you want to make predictions about new individuals, you should instead do k fold cross validation that samples at the individual level. In the first case you are scoring your predictions without regard for the structure of the data. In the second case you are estimating your ability to predict risk within individuals that are not in your sample.
Lastly, remember always that CV is itself data dependent, and only an estimate of your model's predictive capabilities.