How can I generate survival data according to the specified models?

# R – How to Simulate Survival Data

r

#### Related Solutions

Note: the simulated data using simulate.lme does not match elements of the original data structure or model fit (eg. variance, effect size...) nor does it creation of data de novo for experimental design testing.

```
require(nlme)
?nlme::simulate.lme
fit <- lme(distance ~ age + Sex, data = Orthodont, random = ~ 1)
orthSim <- simulate.lme(fit, nsim = 1)
```

This produces a simulated fitting (with a possible alternative model).

This is thanks to an answer by @Momo on one of my questions: Is there a general method for simulating data from a formula or analysis available?

If you require the simulated data, you will need to create a new function from the simulate.lme function.

```
simulate.lme.data<-edit(simulate.lme)
```

add the following line right before the last bracket

```
return(base2)
```

You can then create as much data as you want:

```
orthSimdata <- simulate.lme.data(fit, nsim = 1)
```

Note this is from my (possibly mis-)interpretation of the un-commented code in simulate.lme.

Though this is useful, this seems to do little less than add gaussian noise to your existing data.

This can not be used to directly simulate data de novo. I currently create the start data by adding the numeric value of the factors levels of my experimental design data frame (eg. response=as.numeric(factor1)+as.numeric(factor2)+as.numeric(factor1)*as.numeric(factor1)+rnorm(sd=2)...).

First: you can sample directly from any survival function, $S(t)$ which shows the time-dependent probability of living to that time or longer. The way to do this is by generating uniform RVs $u$ as quantiles and finding $S^{-1}(u)$. This can be done analytically, or below I have an example of how to do it numerically with a pseudocontinuous or discrete time using `colSums(outer(x,y,'<'))`

which beats `quantile`

by many flops.

Second: the survival function is related to the hazard via: $S(t) = exp(-\Lambda(t))$ where $\Lambda(t) = \int_{0}^t \lambda(s) ds$ is called the cumulative hazard function.

So for simplicity let's sample just from the baseline hazard function, omitting any influence of covariates. As a note, the influence of covariates can be added back by generating survival curves for *each* individual in the sample by multiplying the hazard function by their exponentiated linear predictor. The cumulative hazard *could* be found analytically, but a numerical approach with a range of possible failure times is given by:

```
tdom <- seq(0, 5, by=0.01)
haz <- rep(0, length(tdom))
haz[tdom <= 1] <- exp(-0.3*tdom[tdom <= 1])
haz[tdom > 1 & tdom <= 2.5] <- exp(-0.3)
haz[tdom > 2.5] <- exp(0.3*(tdom[tdom > 2.5] - 3.5))
cumhaz <- cumsum(haz*0.01)
Surv <- exp(-cumhaz)
par(mfrow=c(3,1))
plot(tdom, haz, type='l', xlab='Time domain', ylab='Hazard')
plot(tdom, cumhaz, type='l', xlab='Time domain', ylab='Cumulative hazard')
plot(tdom, Surv, type='l', xlab='Time domain', ylab='Survival)
# generate 100 random samples:
u <- runif(100)
failtimes <- tdom[colSums(outer(Surv, u, `>`))]
dev.off()
library(survival)
plot(survfit(Surv(failtimes)~1))
```

Gives:

## Best Answer

Generating $\tilde{T}$ is straightforward from the model. I assume the difficulty lies in sampling $C$ from a Cox model, and you wouldn't be alone in this confusion.

The general methodology used to generate data from a Cox model is inverse CDF sampling (the inverse operation of probability integral transform). The sampling scheme goes as follows:

I'll write the expression inside the exponential function as $h(X,A)$ for convenience. Then the Cox model has the hazard function $\lambda_C(t\mid X,A) = \lambda_0(t)\exp(h(X,A))$, which gives the survival function $S_C(t\mid X,A) = \exp\{-\Lambda_0(t)\exp(h(X,A)) \}$. The CDF and survival function are related to each other: $F_C(t\mid X,A) = 1-S_C(t\mid X,A)$.

Depending on the form of the cumulative hazard $\Lambda_0(t)$, the inverse of $F_C(t\mid X,A)$ may not be available in closed form. If $\Lambda_0^{-1}(x)$ is available, then $F_C^{-1}(u\mid X,A) = \Lambda^{-1}\left[-\log(1-u)\exp(h(X,A)) \right]$. If not, you may want to numerically solve the equation $F_C(t\mid X,A)-u=0$ for $t$ given a uniform variable $u$.