Survival – Simulating Models for Longitudinal and Survival Data

cox-modelfunctional-data-analysispanel datasurvival

This simulation study is taken from this [article] (https://pubmed.ncbi.nlm.nih.gov/35574725/). I am trying to generate this simulation

Theoretical Set up

Define the set of true basis functions,

\begin{align*}
\psi_{1}(t)= \sqrt{2} \cos\left(2 \pi t \right)\\
\psi_{2}(t)= \sqrt{2} \sin\left(4 \pi t \right)\\
\psi_{3}(t)= \sqrt{2} \cos\left(4 \pi t \right)
\end{align*}

such that the constraints $\lVert \psi_{k} \rVert^{2}=1$ if $k=k^{\prime}$, and $0$ otherwise, are fulfilled, $k=1,2,3$. We then independently sample the scores according to $\lambda_{i} \sim MVN(0,\Sigma)$, where $\Sigma=diag(10,6,3)$. Given the set of true basis functions and scores, the longitudinal trajectory can be formulated according to the Karhunen-Loeve expansion as,
\begin{align*}
Z_{i}(t)= \mu(t)+\lambda_{i,1}\psi_{1}(t)+\lambda_{i,2}\psi_{2}(t)+\lambda_{i,3}\psi_{3}(t)
\end{align*}

where the mean function $\mu(t)$ is assumed to be $0$. The individualized realization of the longitudinal trjectoiry $\left\{Z_{i}(t_{i,r)}, r=1,\ldots,R_{i} \right\}$ are assumed to have $\max(R_{i}) \leq 20$ for $\forall i$, constrained by censoring or event occurrence. We consider these $R_{i}$ visits to happen on a fixed time grid from $0$ to $25$, which increment of $25/\max(R_{i})$ unit.

To link covariates to the time-to-event, we assume a proportional hazard model such that the hazard function follows,
\begin{align*}
h_{i}(t)=h_{0}(t) \exp{\left\{\alpha_{1} X_{i}+\int_{0}^{\tau} \phi(t) Z_{i}(t) dt\right\} }
\end{align*}

where $\tau$ is the maximum observation time. The fixed covariate $X_{i}$ is assumed to follow a Bernoulli distribution with a success probability of $0.50$, with the corresponding coefficient $\alpha_{1}$ set to $-1$. Consider the time-varying coefficient:
\begin{align*}
\text{Scenario} : \phi(t)=0.25 \psi_{1}(t)+ 0.50 \psi_{2}(t)+ \psi_{3}(t)\\
\end{align*}

Here, we let the baseline hazard follow a Weibull distribution $h_{0}(t)= \kappa \rho (\rho t)^{\kappa-1}$ with increasing risk over time and consider $\kappa=2, \rho=0.096$. Given the above setup, the survival time $T_{i}$ can then be generated from the inverse of the cumulative hazard function $H^{−1}(u)$, where $u \sim U(0,1)$. We have assumed the independent censoring scheme in this simulation study, where $C_{i} \sim U(0, C_{max})$, with $C_{max}$ set at a value such that the $\%$ of being censored by the end of the study approximately matches our target censoring percentage.

My questions:

  1. How would I satisfy this condition: The individualized realization of the longitudinal trajectory $\left\{Z_{i}(t_{i,r)}, r=1,\ldots,R_{i} \right\}$ are assumed to have $\max(R_{i}) \leq 20$ for $\forall i$, constrained by censoring or event occurrence. We consider these $R_{i}$ visits to happen on a fixed time grid from $0$ to $25$, which increment of $25/\max(R_{i})$ unit.

  2. How would I satisfy this condition: We have assumed the independent censoring scheme in this simulation study, where $C_{i} \sim U(0, C_{max})$, with $C_{max}$ set at a value such that the $\%$ of being censored by the end of the study approximately matches our target censoring percentage (assume like 33% or 66%)

  3. So, this is related to part 1, so I need to know of $Z_{i}$'s should look like before I link them with my covariate, so is the t that I have even make sense. ( I am not sure if this question even makes sense)

I am very sorry for the long post. I have been struggling at this problem for a while now. I appreciate any help I can get. Thank you for your time, and looking forward to reading/applying your comments.

Best Answer

There are 4 simulations here for each individual: the continuous, time-varying covariate $Z_i(t)$, the time-fixed binary covariate $X_i$, the event time $T_i$ (based on $X_i$, the assumed time-varying coefficient $\phi(t)$, and $Z_i(t)$), and the censoring time $C_i$. The censoring simulation needs to be handled last, after the rest of the simulation. This usually takes some playing around to meet any specific censoring percentage.

Start with simulating $Z_i(t)$ among the individuals, as indicated in the first paragraph of "Theoretical Set up." You need to do that in continuous time for the simulation of event times, but in the final simulated data set for modeling you restrict the recorded values for each individual to the values at the specific observation times $R_t$. If $\max(R_i(t)) \le 20$, don't keep $Z_i(t)$ values in the final simulated data set after the time of the 20th point of the time grid.

As the assumed form of $Z_i(t)$ has period 1 in $t$, I at first assumed that the authors intended t=1 to represent the maximum observation time, $\tau$, the upper limit in the displayed integral. In that case, then the 25 total potential observation times would be seq(from = 0.04, to = 1, length.out=25), and the restriction to at most the first 20 observations would mean that you don't record those discretely sampled values beyond t=0.8 for $Z_i(t)$.

Reconsideration: The proposed baseline hazard function has a median survival of about 5 time units, however, so that initial assumption seems to be wrong. I'm not sure what the authors intended the maximum observation time to be. The basic idea in the prior paragraph still holds, however: you have 25 evenly spaced potential observation times for event occurrence, but you only record values of $Z_i(t)$ for the first 20.

For the integral, once you choose a set of the 3 $\lambda$ values, the integrand has a closed form amenable to numeric integration. As written, it seems that the argument to $\exp$ in the formula for $h(t)$ is fixed, simplifying the subsequent integration of $h(t)$ to get the cumulative hazard $H(t)$.

Reconsideration: The formula for $h(t)$ presented by the authors doesn't make sense for a proportional-hazards model with time-varying covariates and time-varying coefficients. In fact, the formula presented by the authors has a time-fixed covariate for each individual that depends explicitly on the last observation time. I think that they intended to write:

\begin{align*} h_{i}(t)=h_{0}(t) \exp{\left\{\alpha_{1} X_{i}+ \phi(t) Z_{i}(t) \right\} } \end{align*}

for the instantaneous hazard, which would then be integrated over time to get the cumulative hazard, up to the last observation time $\tau$.

That handles items 1 and 3, to the limits of what I can figure out. After you have chosen the random $\lambda$ values, $F_i(t)$ for individual $i$ is a continuous closed-form function, used along with the random value of the time-fixed binary covariate $X_i$ to do the integration (in one or the other of the forms discussed above) for each individual's hazard function. But for the $Z_i(t)$ values in the simulated data set, you start by only keeping the values of $Z_i(t)$ up to the 20th discrete observation time.

To simulate each individual's event time, you sample from $U(0,1)$. The authors say to find the corresponding time from the inverse of the individual's cumulative hazard function over time, but I think that they meant to sample from the corresponding survival-time distribution, where $S(t)=\exp(-H(t))$. The integration to get the cumulative hazard only needs to be done out to the last observation time. If the random sampling indicates an event time beyond that, you indicated a censored value at the last observation time.

Only then does it make sense to simulate from $U(0,C_{max})$ for censoring times. There is no way that I know to choose $C_{max}$ simply. Don't forget that some event times will be censored at the last observation time of the study, also. You try some value, find out what fraction of cases would be censored (based on $C_i < T_i$ or $T_i>$ last observation time), and if that doesn't work keep on iterating.

Once you have found a value of $C_{max}$ that gave an appropriate fraction of censored cases, omit from the final data set any data for individual $i$ for which the observation time is greater than the corresponding $C_i$.