Survival – Simulating Survival Times and Accounting for Population Mean Linear Predictor

cox-modelkaplan-meiersimulationsurvival

I have been using the methods outlined in: Bender, Ralf, Thomas Augustin, and Maria Blettner, "Generating survival times to simulate Cox proportional hazards models," Statist. Med. 24: 1713–1723 (2005) to generate survival times from a dataset.

After building the Cox model, I specify a 'baseline patient' and use survfit to estimate the cumulative baseline hazard function:

tmpdf <- data.frame(
  age = mean(pdata$age),
  o2_listing = 0,
  group = "A",
  func_status = median(pdata$func_status),
  bmi_listing = mean(pdata$bmi),
  pa_mean = mean(pdata$pa_mean),
  reg_height = mean(pdata$reg_height),
  pcw = mean(pdata$pcw),
  ventilator = "N",
  pros_inhale = 0,
  egfr = mean(pdata$egfr),
  gender = "F"
)

rc <- cph()  # Model excluded for this post

sv <- survfit(rc, newdata = tmpdf)

I then use nls to calculate the shape (v) and scale(l) parameters from the cumulative baseline hazard function:

c_haz <- sv$cumhaz
sv_t <- sv$time

fit <- nls(c_haz ~ l * sv_t^v, algorithm="port", start=list(l=1, v=1))
l <- coef(fit)[1]
v <- coef(fit)[2]

I can then use the techniques from (R. Bender 2005) to generate random survival times for each patient in the dataset. The calculation involves each patient's linear predictor (sometimes referred to as xb or xbeta in the code).

I can generate survival times by plugging each patient's xbeta into the formula, however when comparing the real survival curve (black) to the simulated curve (blue) you can see the simulation overestimates survival:

Simulated survival times using only patient XBeta

I can also calculate the xbeta for the reference patient:

reference_xb <- unname(predict(rc, newdata=tmpdf, type="lp"))

I can then subtract this reference_xb from each patient's xbeta, however this results in underestimating the survival:

Subtracting the reference patient's XBeta from each individual patient's XBeta results in underestimating survival

However, if I take each patient's xbeta, subtract the reference patient's reference_xb and also subtract the population mean xbeta the curve fits very well:

Much improved fit

I have two main questions:

1 – Is my reasoning/methodology for generating survival times correct? (i.e. will the survival times generated from simulation be representative of the population being simulated?)

2 – I understand why it is necessary to subtract the reference patient's xbeta, but why is it necessary to also subtract the mean xbeta of the population?

Best Answer

I suspect that your problem has to do with categorical predictors coded as numeric, perhaps as 0/1. You specify o2_listing and pros_inhale to have numeric values of 0 (not character "0") for your "baseline patient." If those predictors are coded as numeric as implied, then the values assumed by coxph() and survfit() as their references for linear-predictor calculations and for the baseline survival curve will not be 0, but instead the numeric mean of each. You then need to take the population mean linear predictor into account.

One of the first examples in the main survival vignette can illustrate this problem.

library(survival)
cfit1 <- coxph(Surv(time, status) ~ age + sex + wt.loss, data=lung)

In that data set, sex is coded as 1/2 for male/female. The baseline survival curve and reference for linear-predictor calculations is at a sex value of 1.395, clearly non-binary.

A warning: even then, there's no assurance that you will match the Kaplan-Meier curve you display for "real_pop" with a curve estimated from mean covariate values. Although the linear predictors are linear functions of covariate values, the individual survival curves are doubly nonlinear from there. The survival at time $t$ for a given linear predictor value $\text{LP}$ is $S(t|\text{LP})= S_0(t)^{\exp(\text{LP})}$, where $S_0(t)$ is the baseline survival function. Insofar as the Kaplan-Meier curve can be thought of as the average survival curve over the population, there's no assurance that the mean of those survival curves will coincide with the survival curve estimated from mean covariate values.

It seems that you are doing something more sophisticated than that. As best as I can tell, it seems that you are getting a large number of simulations based on each individual's covariate values to get a survival curve for that individual, then averaging over all the individuals. That should be fine, if perhaps more work than is necessary. As you are fitting the overall model to a Weibull model, you could do that by just calculating the Weibull survival function for each individual without any random sampling. For things like estimating sample-size requirements for complex prospective studies, however, you will need to do random sampling at appropriate overall sample sizes.

Now that you are understanding the basis of these types of simulations, you might be able to take advantage of published, vetted tools to make your life easier. The simsurv package is one example.

Related Question