Solved – Simulation Analysis of a Cox Survival Model with Change Point.

computational-statisticscox-modelrregressionsurvival

I wish to simulate survival data for the following probabilistic model which will be analyzed using a Cox model:

An exposure $X$ is modeled as binary having $X \sim_{iid} \mbox{Bernoulli}(p)$ where $p = \frac{1}{3}$. The hazard is a function of $X$ with:

\begin{eqnarray}
\log \left( \lambda(t|X=0) \right) &=& \beta_0\\
\log \left( \lambda(t|X=1) \right) &=& \beta_0 + \Delta\cdot\mathcal{I}(t \geq \tau)
\end{eqnarray}

I also wish to generate a censoring variable having an independent exponential distribution.

Best Answer

Constant hazards mean that survival times are exponentially distributed. It's possible, in the case of $X=1$ to simulate first whether your exponential time is greater than $\tau$ and then generate what that time will be according to the correct distribution (and, of course, memorylessness).

For an exponential RV $T \sim \mbox{Exp}(\beta_0)$ the probability that $T$ is greater than $\tau$ is just $\exp(-\theta\tau)$. If you compute the indicator of whether this exceeds that value, you can generate the actual value of $T$ using that $Pr(t \leq T | t \geq \tau) = 1-\exp( \left(t-\tau \right)\left(\theta + \Delta \right) )$

In R it looks like this:

n <- 100 ## set sample size
x <- rbinom(n, 1, 1/3) ## generate exposure
tau <- 1 ## set cut off for rate change point
theta <- 0.8 ## baseline rate
delta <- 0.3 ## rate difference after change point
eta <- 2 ## censoring rate
t <- numeric(n)  ## init failure times
t[x==0] <- rexp(sum(x==0), theta) ## generate unconditional failure times for unexposed
i <- rbinom(sum(x==1), 1, exp(-tau*theta)) ## generate indicator for change point
t[x==1] <- rexp(sum(x==1), theta + delta*i) + i*tau ## generate conditional failure times
r <- rexp(n, eta)
eventtm <- ifelse(t < r, t, r)
outcome <- ifelse(t < r, 1, 0)

coxph(Surv(eventtm, outcome) ~ x)

Related Solutions

Solved – Change-point in Cox survival model

You need to make days_ind a time-dependent variable. The way you have it coded right now, everybody whose observation (whether event or censoring time) was after 110 days will have experienced a different hazard throughout their entire followup then those whose observation is before 110 days. What you want to have is for the hazard to "jump" at 110 days.

It is not completely straightforward to set up this analysis in the survival package. You have to split the follow-up period of each person into two periods: up to 110 days and after that. Anybody surviving beyond 110 days would have two observations: one right-censored at 110, and the other left-truncated at 110 and having the actual event on the right side. Fortunately, there is a function to do exactly that: survSplit.

Here is a quick example with a built-in dataset:

> library(survival)
> aml$id <- 1:nrow(aml)  # add a subject ID variable
> aml2 <- survSplit(aml,cut=10,end="time",start="start", event="status", episode="period")
> 
> subset(aml, id<=3)
  time status          x id
1    9      1 Maintained  1
2   13      1 Maintained  2
3   13      0 Maintained  3
> subset(aml2, id<=3)
   time status          x id start period
1     9      1 Maintained  1     0      0
2    10      0 Maintained  2     0      0
3    10      0 Maintained  3     0      0
25   13      1 Maintained  2    10      1
26   13      0 Maintained  3    10      1

You can see that there are now two observations for id's 2 and 3. The period variable corresponds to your days_ind.

From here you can build the model you want, but you have to code the effects carefully, because the effect of period cannot be estimated, since this it refers to different times.

> fit <- coxph(Surv(start, time, status) ~ 
   I((x=="Maintained")&(period==0)) + I((x=="Maintained")&(period==1)), data=aml2)
> fit
Call:
coxph(formula = Surv(start, time, status) ~ I((x == "Maintained") & 
    (period == 0)) + I((x == "Maintained") & (period == 1)), 
    data = aml2)


                                             coef exp(coef) se(coef)     z    p
I((x == "Maintained") & (period == 0))TRUE -1.498     0.224    1.120 -1.34 0.18
I((x == "Maintained") & (period == 1))TRUE -0.722     0.486    0.591 -1.22 0.22

Likelihood ratio test=3.79  on 2 df, p=0.150  n= 41, number of events= 18

Here the two coefficients measure the effect of Maintained vs Non-maintained before 10 days and after 10 days, respectively.

You could also consider using the cmprsk package. It is designed for analysis of competing risks, but there is nothing stopping you from using it for only one outcome. The benefit is that it has an easier way of defining time-dependent covariates (though a really awkward syntax overall):

> library(cmprsk)
> fit1 <- with(aml, crr(time, status, cov1=I(x=="Maintained"), cov2=I(x=="Maintained"), 
+                      tf=function(t)I(t<=10)))
> fit1
convergence:  TRUE 
coefficients:
    I(x == "Maintained")1 I(x == "Maintained")1*tf1 
                  -0.7213                   -0.7387 
standard errors:
[1] 0.5259 1.1500
two-sided p-values:
    I(x == "Maintained")1 I(x == "Maintained")1*tf1 
                     0.17                      0.52

Note that with the different coding, the meaning of the coefficients is not exactly the same as above.

Solved – How to simulate survival times using true base line hazard function

First: you can sample directly from any survival function, $S(t)$ which shows the time-dependent probability of living to that time or longer. The way to do this is by generating uniform RVs $u$ as quantiles and finding $S^{-1}(u)$. This can be done analytically, or below I have an example of how to do it numerically with a pseudocontinuous or discrete time using colSums(outer(x,y,'<')) which beats quantile by many flops.

Second: the survival function is related to the hazard via: $S(t) = exp(-\Lambda(t))$ where $\Lambda(t) = \int_{0}^t \lambda(s) ds$ is called the cumulative hazard function.

So for simplicity let's sample just from the baseline hazard function, omitting any influence of covariates. As a note, the influence of covariates can be added back by generating survival curves for each individual in the sample by multiplying the hazard function by their exponentiated linear predictor. The cumulative hazard could be found analytically, but a numerical approach with a range of possible failure times is given by:

tdom <- seq(0, 5, by=0.01)
haz <- rep(0, length(tdom))
haz[tdom <= 1] <- exp(-0.3*tdom[tdom <= 1])
haz[tdom > 1 & tdom <= 2.5] <- exp(-0.3)
haz[tdom > 2.5] <- exp(0.3*(tdom[tdom > 2.5] - 3.5))
cumhaz <- cumsum(haz*0.01)
Surv <- exp(-cumhaz)
par(mfrow=c(3,1))
plot(tdom, haz, type='l', xlab='Time domain', ylab='Hazard')
plot(tdom, cumhaz, type='l', xlab='Time domain', ylab='Cumulative hazard')
plot(tdom, Surv, type='l', xlab='Time domain', ylab='Survival)

# generate 100 random samples:
u <- runif(100)
failtimes <- tdom[colSums(outer(Surv, u, `>`))]

dev.off()
library(survival)
plot(survfit(Surv(failtimes)~1))

Gives:

Best Answer

Related Solutions

Solved – Change-point in Cox survival model

Solved – How to simulate survival times using true base line hazard function

Related Question