Survival – Simulating Models for Longitudinal and Survival Data

cox-modelfunctional-data-analysispanel datasurvival

This simulation study is taken from this [article] (https://pubmed.ncbi.nlm.nih.gov/35574725/). I am trying to generate this simulation

Theoretical Set up

Define the set of true basis functions,

\begin{align*}
\psi_{1}(t)= \sqrt{2} \cos\left(2 \pi t \right)\\
\psi_{2}(t)= \sqrt{2} \sin\left(4 \pi t \right)\\
\psi_{3}(t)= \sqrt{2} \cos\left(4 \pi t \right)
\end{align*}
such that the constraints $\lVert \psi_{k} \rVert^{2}=1$ if $k=k^{\prime}$, and $0$ otherwise, are fulfilled, $k=1,2,3$. We then independently sample the scores according to $\lambda_{i} \sim MVN(0,\Sigma)$, where $\Sigma=diag(10,6,3)$. Given the set of true basis functions and scores, the longitudinal trajectory can be formulated according to the Karhunen-Loeve expansion as,
\begin{align*}
Z_{i}(t)= \mu(t)+\lambda_{i,1}\psi_{1}(t)+\lambda_{i,2}\psi_{2}(t)+\lambda_{i,3}\psi_{3}(t)
\end{align*}
where the mean function $\mu(t)$ is assumed to be $0$. The individualized realization of the longitudinal trjectoiry $\left\{Z_{i}(t_{i,r)}, r=1,\ldots,R_{i} \right\}$ are assumed to have $\max(R_{i}) \leq 20$ for $\forall i$, constrained by censoring or event occurrence. We consider these $R_{i}$ visits to happen on a fixed time grid from $0$ to $25$, which increment of $25/\max(R_{i})$ unit.

To link covariates to the time-to-event, we assume a proportional hazard model such that the hazard function follows,
\begin{align*}
h_{i}(t)=h_{0}(t) \exp{\left\{\alpha_{1} X_{i}+\int_{0}^{\tau} \phi(t) Z_{i}(t) dt\right\} }
\end{align*}
where $\tau$ is the maximum observation time. The fixed covariate $X_{i}$ is assumed to follow a Bernoulli distribution with a success probability of $0.50$, with the corresponding coefficient $\alpha_{1}$ set to $-1$. Consider the time-varying coefficient:
\begin{align*}
\text{Scenario} : \phi(t)=0.25 \psi_{1}(t)+ 0.50 \psi_{2}(t)+ \psi_{3}(t)\\
\end{align*}

Here, we let the baseline hazard follow a Weibull distribution $h_{0}(t)= \kappa \rho (\rho t)^{\kappa-1}$ with increasing risk over time and consider $\kappa=2, \rho=0.096$. Given the above setup, the survival time $T_{i}$ can then be generated from the inverse of the cumulative hazard function $H^{−1}(u)$, where $u \sim U(0,1)$. We have assumed the independent censoring scheme in this simulation study, where $C_{i} \sim U(0, C_{max})$, with $C_{max}$ set at a value such that the $\%$ of being censored by the end of the study approximately matches our target censoring percentage.

My questions:

How would I satisfy this condition: The individualized realization of the longitudinal trajectory $\left\{Z_{i}(t_{i,r)}, r=1,\ldots,R_{i} \right\}$ are assumed to have $\max(R_{i}) \leq 20$ for $\forall i$, constrained by censoring or event occurrence. We consider these $R_{i}$ visits to happen on a fixed time grid from $0$ to $25$, which increment of $25/\max(R_{i})$ unit.
How would I satisfy this condition: We have assumed the independent censoring scheme in this simulation study, where $C_{i} \sim U(0, C_{max})$, with $C_{max}$ set at a value such that the $\%$ of being censored by the end of the study approximately matches our target censoring percentage (assume like 33% or 66%)
So, this is related to part 1, so I need to know of $Z_{i}$'s should look like before I link them with my covariate, so is the t that I have even make sense. ( I am not sure if this question even makes sense)

I am very sorry for the long post. I have been struggling at this problem for a while now. I appreciate any help I can get. Thank you for your time, and looking forward to reading/applying your comments.

Best Answer

There are 4 simulations here for each individual: the continuous, time-varying covariate $Z_i(t)$, the time-fixed binary covariate $X_i$, the event time $T_i$ (based on $X_i$, the assumed time-varying coefficient $\phi(t)$, and $Z_i(t)$), and the censoring time $C_i$. The censoring simulation needs to be handled last, after the rest of the simulation. This usually takes some playing around to meet any specific censoring percentage.

Start with simulating $Z_i(t)$ among the individuals, as indicated in the first paragraph of "Theoretical Set up." You need to do that in continuous time for the simulation of event times, but in the final simulated data set for modeling you restrict the recorded values for each individual to the values at the specific observation times $R_t$. If $\max(R_i(t)) \le 20$, don't keep $Z_i(t)$ values in the final simulated data set after the time of the 20th point of the time grid.

As the assumed form of $Z_i(t)$ has period 1 in $t$, I at first assumed that the authors intended t=1 to represent the maximum observation time, $\tau$, the upper limit in the displayed integral. In that case, then the 25 total potential observation times would be seq(from = 0.04, to = 1, length.out=25), and the restriction to at most the first 20 observations would mean that you don't record those discretely sampled values beyond t=0.8 for $Z_i(t)$.

Reconsideration: The proposed baseline hazard function has a median survival of about 5 time units, however, so that initial assumption seems to be wrong. I'm not sure what the authors intended the maximum observation time to be. The basic idea in the prior paragraph still holds, however: you have 25 evenly spaced potential observation times for event occurrence, but you only record values of $Z_i(t)$ for the first 20.

For the integral, once you choose a set of the 3 $\lambda$ values, the integrand has a closed form amenable to numeric integration. As written, it seems that the argument to $\exp$ in the formula for $h(t)$ is fixed, simplifying the subsequent integration of $h(t)$ to get the cumulative hazard $H(t)$.

Reconsideration: The formula for $h(t)$ presented by the authors doesn't make sense for a proportional-hazards model with time-varying covariates and time-varying coefficients. In fact, the formula presented by the authors has a time-fixed covariate for each individual that depends explicitly on the last observation time. I think that they intended to write:

\begin{align*} h_{i}(t)=h_{0}(t) \exp{\left\{\alpha_{1} X_{i}+ \phi(t) Z_{i}(t) \right\} } \end{align*}

for the instantaneous hazard, which would then be integrated over time to get the cumulative hazard, up to the last observation time $\tau$.

That handles items 1 and 3, to the limits of what I can figure out. After you have chosen the random $\lambda$ values, $F_i(t)$ for individual $i$ is a continuous closed-form function, used along with the random value of the time-fixed binary covariate $X_i$ to do the integration (in one or the other of the forms discussed above) for each individual's hazard function. But for the $Z_i(t)$ values in the simulated data set, you start by only keeping the values of $Z_i(t)$ up to the 20th discrete observation time.

To simulate each individual's event time, you sample from $U(0,1)$. The authors say to find the corresponding time from the inverse of the individual's cumulative hazard function over time, but I think that they meant to sample from the corresponding survival-time distribution, where $S(t)=\exp(-H(t))$. The integration to get the cumulative hazard only needs to be done out to the last observation time. If the random sampling indicates an event time beyond that, you indicated a censored value at the last observation time.

Only then does it make sense to simulate from $U(0,C_{max})$ for censoring times. There is no way that I know to choose $C_{max}$ simply. Don't forget that some event times will be censored at the last observation time of the study, also. You try some value, find out what fraction of cases would be censored (based on $C_i < T_i$ or $T_i>$ last observation time), and if that doesn't work keep on iterating.

Once you have found a value of $C_{max}$ that gave an appropriate fraction of censored cases, omit from the final data set any data for individual $i$ for which the observation time is greater than the corresponding $C_i$.

Related Solutions

Survival – Correct Treatment of Censorship and Observation Periods in Longitudinal Survival Analysis

There's seldom anything to be gained by throwing away information. You do have to include it correctly in your analysis, however. Sometimes, as described below, censoring is not the correct choice.

I still refer frequently to the Leung et al review on censoring, even though it's a quarter-century old. Read it for more insight on what follows.

Scenario 1

Scenario 1 is what's described as "Type I censoring" by Leung et al:

a study in which every subject is under observation for a specified period $C_0$ or until failure.

So I wouldn't call those "lost-to-follow-up." It's just an accepted type of study design with censoring. Yes, count them as right-censored as of their last follow up within your time window.

Scenarios 2 and 3

There is a big risk here of informative censoring if your event is "remission"* while some individuals die (Scenario 2) or might leave the study due to illness (e.g., in Scenario 3). At the least, this would require a competing-risks analysis if each event is absorbing (no return to the study after the event). As the vignette mentioned by Frank Harrell in a comment says in Section 2.2 (page 8):

A common mistake with competing risks is to use the Kaplan-Meier separately on each event type while treating other event types as censored.

So for Scenario 2 you certainly can't remove those who died, and you can't censor on death dates. Death needs to be treated as a competing event.

You have to apply your understanding of the subject matter with respect to how to treat Scenario 3. Is censoring under that Scenario informative about future event times?

Scenario 4

If individuals with less than 5 years of follow up were still enrolled and available in principle for data collection at the time that you chose to end the study, they can still provide information up through their last observation date. As you describe it, that censoring wouldn't seem to be informative. Those individuals could be considered as having right-censored event times as of their last observation. That they didn't get all the way to 5 years doesn't matter; that's an advantage of survival analysis that appropriately incorporates information about censoring times.

*If the event is "remission" you really have a type of cure model, as that event (unlike death) presumably doesn't eventually happen to all.

R Survival Analysis – Generating Survival Data with Functional Data

The coding-specific part of this question is off-topic on this site, but there is one principle of survival analysis that you should consider implementing.

Simulating event times typically starts as you suggest, sampling from a uniform distribution over (0,1) and then finding the time corresponding to that survival fraction. The way you have this structured makes sense if you want to sample multiple event times from a distribution. In this scenario, however, you only want to take a single event-time sample from each of the randomly generated hazard functions.

Take advantage of the following relationship between $S(t)$ and the cumulative hazard $H(t)$:

$$H(t)=- \log S(t)$$

After you have your sample of the survival fraction from the uniform distribution, start to integrate your (necessarily non-negative) hazard function $h(t)$ from $t=0$ until you reach the value of $H$ corresponding to that survival fraction. Record the upper limit of integration at which that occurs as the event time. If you instead get to some maximum observation time without reaching that value of $H$, record a right-censored observation at that maximum time.

As an example, sample a survival probability and find the corresponding cumulative hazard:

set.seed(2)
(cumHazTarget <- -log(runif(1)))
## [1] 1.688036

Define your hazard function, with random effects of 0 in this instance:

hazF <- function(t) 2*t*exp(-5)*exp(0.5*(1+2*t))

Now find the value of $t$ that gives an integrated (i.e., cumulative) hazard equal to the above target. This page shows a simple way to find the upper limit of integration that gives a desired target for a definite integral. More-or-less copying that here (with a restriction of the lower limit of integration to 0) and applying it to this instance:

findprob <- function(f, interval, target) {
    optimize(function(x) {
        abs(integrate(f, 0, x)$value-target)
    }, interval)$minimum
}
findprob(hazF, interval=c(0,10),cumHazTarget)
## [1] 3.429483

That gives the time at which the cumulative hazard equals that corresponding to your sampled survival probability. (I suspect that there is a more efficient way to do this that takes advantage of the non-negativity of the hazard function, but this illustrates the principle.)

Set the top limit of interval to a value somewhat above your maximum observation time and treat times greater than the maximum observation time as right censored. For example, if your maximum observation time is 3, in this instance the function returns a time value slightly above 3, which you could then right censor at 3:

findprob(hazF, interval=c(0,3.1),cumHazTarget)
## [1] 3.099922

Best Answer

Related Solutions

Survival – Correct Treatment of Censorship and Observation Periods in Longitudinal Survival Analysis

R Survival Analysis – Generating Survival Data with Functional Data

Related Question