Solved – Simulate time to event times based on an existing subset of data

rsimulationsurvival

I have to task to predict data volume to be processed for a study that currently has 40 patients, but will have 100 patients eventually. Since most of the data is generated during the treatment phase, i would like to focus on this phase.

For these 40 patients, i calculated the time to end of treatment (censoring patients still on treatment). I thought that using this survival curve, i would be able to 'draw' another 60 patients from the same curve. If i were to have the survival times (e.g. time to end of treatment) for all 100 patients I would be able to add up all time to end of treatment times, to get an idea of the total number of months of treatment in our sample. To do this, i would assume for example 5 patients to be accrued per month. Then add the simulated survival time for each patient, take a close out date and sum 'treatment time' for all patients.

I have googled everywhere and found all kinds of packages that can simulate the most complex survival data, however, i fail to find/understand how to simulate 'simple' survival data.

Does anyone know how i can simulate X survival times using existing data?
Surely i can do it roughly by hand as there are many knowns (e.g. estimated survival time for various proportions of the data) but surely there would be an easier method?

The lung dataset might provide a good example…

 sfit <-survfit(Surv(time,status)~1,data=lung)
 plot(sfit)

This dataset contains 228 patients. How would you simulate data for 1000 patients with similar characterisitics of the first 228 patients (e.g. median survival time of 310 days, etc)?

EDIT: i guess i just need to be able to sample from the Weibull (or exponential) distributions. Question is, how to estimate the Weibull parameters using the data that i already have, in a way that it incorporates cencored data….

Best Answer

You should first decide on a survival time distribution that best fits the data you have already, this can be done by fitting several parametric distributions (e.g exponential, Weibull, log-normal e.t.c the R package 'flexsurv' will be useful as it provides the AIC as part of its return values) to the data from the 40 patients and comparing the AIC of the models.

Below is a mock of how to choose between say a log-normal and an exponential model for the lung cancer data.

library(flexsurv)
surv<-with(lung, Surv(time, status))  # create the survival object
model<-flexsurvreg(surv ~ 1, dist="lnorm")  # fit the log-normal model
model
Call:
flexsurvreg(formula = surv ~ 1, dist = "lnorm")
Estimates: 
         est    L95%   U95% 
meanlog  5.660  5.510  5.820
sdlog    1.100  0.983  1.230

N = 228,  Events: 165,  Censored: 63
Total time at risk: 69593
Log-likelihood = -1169.269, df = 2
AIC = 2342.538

Lets have a visual of the fit

plot(model, ylab="Survival probability", xlab="Time")
legend("topright",legend=c("KM Plot","Fitted"), lty=c(1,1),col=c("black","red"), cex=0.5)

Log-normal model

Now lets try the Weibull model

model2<-flexsurvreg(surv ~ 1, dist="weibull") # fit weibull model
model2
Call:
flexsurvreg(formula = surv ~ 1, dist = "weibull")
Estimates:
          est     L95%    U95% 
shape    1.32    1.17    1.49
scale  418.00  372.00  469.00

N = 228,  Events: 165,  Censored: 63
Total time at risk: 69593
Log-likelihood = -1153.851, df = 2
AIC = 2311.702

See how it fits below

plot(model2, ylab="Survival probability", xlab="Time")
legend("topright",legend=c("KM Plot","Fitted"), lty=c(1,1),col=c("black","red"), cex=0.5)

Weibull model

Model 2 (the Weibull distribution) fits better, it has a lower AIC. This is also supported by the graph which shows smaller deviations between the fitted and observed failure probabilities over time for the Weibull distribution. You can then generate your random samples from your choice of the best fitting distribution. I have used Weibull but it might not be the case with your data.

n<-100 # number of samples to draw
T<-rweibull(n, shape=1/exp(model2$coefficients['shape']), scale = exp(model2$coefficients['scale'])) #  time to event
C<-rweibull(n, shape=1/exp(model2$coefficients['shape']), scale = 450)  # random time to decide censoring, scale parameter above that estimated from the sample data
(sum(T>C)/n)*100 # censoring rate (%), adjust the scale parameter for 'C' to make this as close as possible to the observed censoring rate in the sample data

T_new<-NULL
for (i in 1:n) T_new[i] <- min(T[i],C[i]) # the final time to use as your sampled time
d <- (T<C)*1  # censoring indicator: event occurs if d==1

Now you have the time and the censoring indicator, you actually don't need the later from your description; your interest is in the total of T_new. You will actually have to repeat picking samples of 100 several times, say 1000, then pick the average of the total of T_new across the replications.

I hope this was helpful.

Related Solutions

Solved – How to simulate survival times using true base line hazard function

First: you can sample directly from any survival function, $S(t)$ which shows the time-dependent probability of living to that time or longer. The way to do this is by generating uniform RVs $u$ as quantiles and finding $S^{-1}(u)$. This can be done analytically, or below I have an example of how to do it numerically with a pseudocontinuous or discrete time using colSums(outer(x,y,'<')) which beats quantile by many flops.

Second: the survival function is related to the hazard via: $S(t) = exp(-\Lambda(t))$ where $\Lambda(t) = \int_{0}^t \lambda(s) ds$ is called the cumulative hazard function.

So for simplicity let's sample just from the baseline hazard function, omitting any influence of covariates. As a note, the influence of covariates can be added back by generating survival curves for each individual in the sample by multiplying the hazard function by their exponentiated linear predictor. The cumulative hazard could be found analytically, but a numerical approach with a range of possible failure times is given by:

tdom <- seq(0, 5, by=0.01)
haz <- rep(0, length(tdom))
haz[tdom <= 1] <- exp(-0.3*tdom[tdom <= 1])
haz[tdom > 1 & tdom <= 2.5] <- exp(-0.3)
haz[tdom > 2.5] <- exp(0.3*(tdom[tdom > 2.5] - 3.5))
cumhaz <- cumsum(haz*0.01)
Surv <- exp(-cumhaz)
par(mfrow=c(3,1))
plot(tdom, haz, type='l', xlab='Time domain', ylab='Hazard')
plot(tdom, cumhaz, type='l', xlab='Time domain', ylab='Cumulative hazard')
plot(tdom, Surv, type='l', xlab='Time domain', ylab='Survival)

# generate 100 random samples:
u <- runif(100)
failtimes <- tdom[colSums(outer(Surv, u, `>`))]

dev.off()
library(survival)
plot(survfit(Surv(failtimes)~1))

Gives:

Solved – Interpreting Median Survival vs Median in survival data

So I think I figured it out.

In the KM, censored individuals are assumed to be still alive despite their OS time. Thus the 50% assumes that those on the left are on the right side of the distribution.

When running a KM with individuals that all had an event, the median survival time is the same as the standard median.

Best Answer

Related Solutions

Solved – How to simulate survival times using true base line hazard function

Solved – Interpreting Median Survival vs Median in survival data

Related Question