R Survival Analysis – Generating Survival Data with Functional Data

functional-data-analysisproportional-hazardsrsimulationsurvival

I am reading this article here and trying to regenerate their simulation study. Here is this scenario here, among others, but if I can figure out one, the rest follow. That is,

Simulation set-up

we assume the hazard function of subject $i$ is
\begin{equation}
h_i(t|Z_i(t)) = h_0(t) \exp(\alpha Z_i(t)),
\end{equation}

where $h_0(t) = \lambda t^{\lambda-1} \exp(\eta)$, a Weibull baseline hazard function with $\lambda = 2$, $\eta = -5$, and the association parameter $\alpha = 0.5$.

Consider the linear model.

$$
Z_i(t) = a + bt + b_{i1} + b_{i2}t,
$$

The linear longitudinal trajectory is described with $a = 1$, $b = -2$. Trajectories considered above use random effect terms $b_i = (b_{i1}, b_{i2}) \sim \mathcal{N}(0,D))$, with $D = \begin{pmatrix} 0.4 & 0.1 \\ 0.1 & 0.2 \end{pmatrix}$. For simplicity, we generate longitudinal data on irregular time points $t = 0$ and $t = j + \epsilon_{ij}$, $j = 1, 2, \ldots, 10$ and $\epsilon_{ij} \sim \mathcal{N}(0, .1^{2})$ independent across all $i$ and $j$. We simulated the censoring times from a uniform distribution in $(0, t_{\text{max}})$, with $t_{\text{max}}$ set to result in about 25% censoring.

My questions

  1. Eventually, my goal is to generate survival time (the follow-up time) and status for each subject. So, should I integrate \begin{equation}
    h_i(t|Z_i(t)) = h_0(t) \exp(\alpha Z_i(t)),
    \end{equation}
    from $0$ to $t$ to obtain the cumulative hazard $H(t)$ then obtain $S(t)=\exp(-H(t))$ and the inverse of $S(t)$, $t=S^{-1}(u)$. You then generate $U\sim\mathrm{Uniform\left(0,1\right)}$, substituting $U$ for $S\left(t\right)$ and to simulate $t$, the follow up time.

  2. Given what I have said in part (1) is correct, here I am trying to do it numerically, and obviously not working since I have negative time and am also not sure how I would be using my t as described in the simulation. It would be possible to do it analytically and work through steps to obtain the survival time.

Best Answer

The coding-specific part of this question is off-topic on this site, but there is one principle of survival analysis that you should consider implementing.

Simulating event times typically starts as you suggest, sampling from a uniform distribution over (0,1) and then finding the time corresponding to that survival fraction. The way you have this structured makes sense if you want to sample multiple event times from a distribution. In this scenario, however, you only want to take a single event-time sample from each of the randomly generated hazard functions.

Take advantage of the following relationship between $S(t)$ and the cumulative hazard $H(t)$:

$$H(t)=- \log S(t)$$

After you have your sample of the survival fraction from the uniform distribution, start to integrate your (necessarily non-negative) hazard function $h(t)$ from $t=0$ until you reach the value of $H$ corresponding to that survival fraction. Record the upper limit of integration at which that occurs as the event time. If you instead get to some maximum observation time without reaching that value of $H$, record a right-censored observation at that maximum time.

As an example, sample a survival probability and find the corresponding cumulative hazard:

set.seed(2)
(cumHazTarget <- -log(runif(1)))
## [1] 1.688036

Define your hazard function, with random effects of 0 in this instance:

hazF <- function(t) 2*t*exp(-5)*exp(0.5*(1+2*t))

Now find the value of $t$ that gives an integrated (i.e., cumulative) hazard equal to the above target. This page shows a simple way to find the upper limit of integration that gives a desired target for a definite integral. More-or-less copying that here (with a restriction of the lower limit of integration to 0) and applying it to this instance:

findprob <- function(f, interval, target) {
    optimize(function(x) {
        abs(integrate(f, 0, x)$value-target)
    }, interval)$minimum
}
findprob(hazF, interval=c(0,10),cumHazTarget)
## [1] 3.429483

That gives the time at which the cumulative hazard equals that corresponding to your sampled survival probability. (I suspect that there is a more efficient way to do this that takes advantage of the non-negativity of the hazard function, but this illustrates the principle.)

Set the top limit of interval to a value somewhat above your maximum observation time and treat times greater than the maximum observation time as right censored. For example, if your maximum observation time is 3, in this instance the function returns a time value slightly above 3, which you could then right censor at 3:

findprob(hazF, interval=c(0,3.1),cumHazTarget)
## [1] 3.099922