Survival – Simulating Survival Times and Accounting for Population Mean Linear Predictor

cox-modelkaplan-meiersimulationsurvival

I have been using the methods outlined in: Bender, Ralf, Thomas Augustin, and Maria Blettner, "Generating survival times to simulate Cox proportional hazards models," Statist. Med. 24: 1713–1723 (2005) to generate survival times from a dataset.

After building the Cox model, I specify a 'baseline patient' and use survfit to estimate the cumulative baseline hazard function:

tmpdf <- data.frame(
  age = mean(pdata$age),
  o2_listing = 0,
  group = "A",
  func_status = median(pdata$func_status),
  bmi_listing = mean(pdata$bmi),
  pa_mean = mean(pdata$pa_mean),
  reg_height = mean(pdata$reg_height),
  pcw = mean(pdata$pcw),
  ventilator = "N",
  pros_inhale = 0,
  egfr = mean(pdata$egfr),
  gender = "F"
)

rc <- cph()  # Model excluded for this post

sv <- survfit(rc, newdata = tmpdf)

I then use nls to calculate the shape (v) and scale(l) parameters from the cumulative baseline hazard function:

c_haz <- sv$cumhaz
sv_t <- sv$time

fit <- nls(c_haz ~ l * sv_t^v, algorithm="port", start=list(l=1, v=1))
l <- coef(fit)[1]
v <- coef(fit)[2]

I can then use the techniques from (R. Bender 2005) to generate random survival times for each patient in the dataset. The calculation involves each patient's linear predictor (sometimes referred to as xb or xbeta in the code).

I can generate survival times by plugging each patient's xbeta into the formula, however when comparing the real survival curve (black) to the simulated curve (blue) you can see the simulation overestimates survival:

I can also calculate the xbeta for the reference patient:

reference_xb <- unname(predict(rc, newdata=tmpdf, type="lp"))

I can then subtract this reference_xb from each patient's xbeta, however this results in underestimating the survival:

However, if I take each patient's xbeta, subtract the reference patient's reference_xb and also subtract the population mean xbeta the curve fits very well:

I have two main questions:

1 – Is my reasoning/methodology for generating survival times correct? (i.e. will the survival times generated from simulation be representative of the population being simulated?)

2 – I understand why it is necessary to subtract the reference patient's xbeta, but why is it necessary to also subtract the mean xbeta of the population?

Best Answer

I suspect that your problem has to do with categorical predictors coded as numeric, perhaps as 0/1. You specify o2_listing and pros_inhale to have numeric values of 0 (not character "0") for your "baseline patient." If those predictors are coded as numeric as implied, then the values assumed by coxph() and survfit() as their references for linear-predictor calculations and for the baseline survival curve will not be 0, but instead the numeric mean of each. You then need to take the population mean linear predictor into account.

One of the first examples in the main survival vignette can illustrate this problem.

library(survival)
cfit1 <- coxph(Surv(time, status) ~ age + sex + wt.loss, data=lung)

In that data set, sex is coded as 1/2 for male/female. The baseline survival curve and reference for linear-predictor calculations is at a sex value of 1.395, clearly non-binary.

A warning: even then, there's no assurance that you will match the Kaplan-Meier curve you display for "real_pop" with a curve estimated from mean covariate values. Although the linear predictors are linear functions of covariate values, the individual survival curves are doubly nonlinear from there. The survival at time $t$ for a given linear predictor value $\text{LP}$ is $S(t|\text{LP})= S_0(t)^{\exp(\text{LP})}$, where $S_0(t)$ is the baseline survival function. Insofar as the Kaplan-Meier curve can be thought of as the average survival curve over the population, there's no assurance that the mean of those survival curves will coincide with the survival curve estimated from mean covariate values.

It seems that you are doing something more sophisticated than that. As best as I can tell, it seems that you are getting a large number of simulations based on each individual's covariate values to get a survival curve for that individual, then averaging over all the individuals. That should be fine, if perhaps more work than is necessary. As you are fitting the overall model to a Weibull model, you could do that by just calculating the Weibull survival function for each individual without any random sampling. For things like estimating sample-size requirements for complex prospective studies, however, you will need to do random sampling at appropriate overall sample sizes.

Now that you are understanding the basis of these types of simulations, you might be able to take advantage of published, vetted tools to make your life easier. The simsurv package is one example.

Related Solutions

Solved – How to simulate survival times using true base line hazard function

First: you can sample directly from any survival function, $S(t)$ which shows the time-dependent probability of living to that time or longer. The way to do this is by generating uniform RVs $u$ as quantiles and finding $S^{-1}(u)$. This can be done analytically, or below I have an example of how to do it numerically with a pseudocontinuous or discrete time using colSums(outer(x,y,'<')) which beats quantile by many flops.

Second: the survival function is related to the hazard via: $S(t) = exp(-\Lambda(t))$ where $\Lambda(t) = \int_{0}^t \lambda(s) ds$ is called the cumulative hazard function.

So for simplicity let's sample just from the baseline hazard function, omitting any influence of covariates. As a note, the influence of covariates can be added back by generating survival curves for each individual in the sample by multiplying the hazard function by their exponentiated linear predictor. The cumulative hazard could be found analytically, but a numerical approach with a range of possible failure times is given by:

tdom <- seq(0, 5, by=0.01)
haz <- rep(0, length(tdom))
haz[tdom <= 1] <- exp(-0.3*tdom[tdom <= 1])
haz[tdom > 1 & tdom <= 2.5] <- exp(-0.3)
haz[tdom > 2.5] <- exp(0.3*(tdom[tdom > 2.5] - 3.5))
cumhaz <- cumsum(haz*0.01)
Surv <- exp(-cumhaz)
par(mfrow=c(3,1))
plot(tdom, haz, type='l', xlab='Time domain', ylab='Hazard')
plot(tdom, cumhaz, type='l', xlab='Time domain', ylab='Cumulative hazard')
plot(tdom, Surv, type='l', xlab='Time domain', ylab='Survival)

# generate 100 random samples:
u <- runif(100)
failtimes <- tdom[colSums(outer(Surv, u, `>`))]

dev.off()
library(survival)
plot(survfit(Surv(failtimes)~1))

Gives:

Solved – How should I interpret the exp(coef) hazard ratio in Cox regression

Alright, a couple things.

First: The hazard is defined as the instantaneous probability of an event at time t, conditional on it not having occured in any previous time.

So yes, the hazard ratio is a ratio of hazards - in your case, Hazard(Radiation=Yes)/Hazard(Radiation=No).

That ratio is all you need to know. It indicates that someone who receives radiation has about half the hazard of one who doesn't - that is, they are less likely to have the event (and by extension, survive, if your event is 'died of cancer'). If you're familiar with odds ratios or relative risks, you can think about these in broadly the same way for interpretation.

For a single binary exposure like what you are describing, which one R returns doesn't really change things. The HR for Radiation=No is just 1/0.5882 = 1.70, which has the same interpretation but changing the focus - that is, people who don't receive radiation have a little under twice the hazard as those who do.

Best Answer

Related Solutions

Solved – How to simulate survival times using true base line hazard function

Solved – How should I interpret the exp(coef) hazard ratio in Cox regression

Related Question