Survival – How to Introduce Uncertainty in Fitting Original Data When Simulating Survival Curves

rsimulationstandard errorsurvival

The code below simulates the uncertainty in the lognormal distribution parameters, using the MASS::mvrnorm() function and the lung dataset from the survival package. Although the parametric distribution providing the best fit is Weibull, for illustrative purposes I'm using the lognormal distribution.

When running the code, per the image at the bottom of this post, the solid green line shows the Kaplan-Meier curve (probabilities) of the lung data, the dashed-green lines the confidence interval surrounding the K-M probabilities, the red line the fitted survival curve for lung data using the lognormal distribution, and the dashed-blue lines show 5 simulation runs.

My question is, how could I introduce the inherent uncertainty in fitting the original data when running the simulation? In addition to the uncertainty currently simulated of the lognormal parameters. Note in the image the width of the 95% confidence intervals around the K-M curve. It seems that the simulation runs (dashed blue lines) should at least be as wide around the fitted lognormal survival curve (red line) as the 95% CI lines around the K-M curve.

Code:

library(MASS)
library(survival)

fit <- survreg(Surv(time, status) ~ 1, data = lung, dist = "lognormal")

time <- seq(0, 1000, by = 1)
meanlog <- fit$coef  # mean on the log scale
sdlog <- fit$scale  # standard deviation on the log scale
var_cov <- vcov(fit) #  extract the variance-covariance matrix

# Compute the lognormal survival function
survival <- 1 - plnorm(time, meanlog = meanlog, sdlog = sdlog)

num_simulations <- 5

# Generate random lognormal parameter estimates for simulations
sim_params <- MASS::mvrnorm(num_simulations, mu = c(meanlog, sdlog), Sigma = var_cov)

# Compute the survival curves for each simulation
sim_curves <- sapply(1:num_simulations, function(i) 1 - plnorm(time, meanlog = sim_params[i, 1], sdlog = sim_params[i, 2]))

# Compute the Kaplan-Meier survival curve for the lung dataset
lung_surv <- survfit(Surv(time, status) ~ 1, data = lung)

# Plot the lognormal survival curve, simulation lines, and Kaplan-Meier plot
plot(time, survival, type = "l", xlab = "Time", ylab = "Survival Probability", 
     main = "Lognormal Survival Curve of Lung Dataset", col = "red", lwd = 2)
lapply(1:num_simulations, function(i) lines(time, sim_curves[, i], col = "blue", lty = "dashed"))
lines(lung_surv, col = "green")

# Store the coordinates of the simulation lines
sim_lines <- lapply(1:num_simulations, function(i) {
  curve <- sim_curves[, i]
  lines(time, curve, col = "blue", lty = "dashed")
  return(data.frame(time = time, survival = curve))
})

Output of the above code:
enter image description here

Best Answer

If you have some parametric function $F$ for a survival curve that predicts the time $T$ of some event

$$P(T\leq t | \theta_1,\theta_2) = F(t;\theta_1,\theta_2)$$

and the parameters themselves are random as well

$$\boldsymbol{\theta} \sim MVN(\boldsymbol{\mu}, \boldsymbol{\sigma})$$

then you can express the probability by integrating over all cases:

$$P(T\leq t) = \iint_{\forall \theta_1,\theta_2} P(T\leq t | \theta_1,\theta_2) f(\theta_1,\theta_2) d\theta_1 d\theta_2$$

Possibly this might be evaluated analytically or approximated. Your question seems to use the approach of simulations. In that case the result is the average of your simulated curves.


Computational example:

Let the waiting time for the event be exponential distributed

$$T \sim Exp(\lambda)$$

with a variable rate $\lambda$

$$\lambda \sim N(1,0.04)$$

The the survival curve can be computed as the average

$$S(t) = E[exp(-\lambda t)]$$

and this follows a log normal distribution (approximately because the case here is truncated at zero) with mean parameter $-\mu t$ and scale parameter $\sigma t$ thus we have

$$S(t) \approx exp(-\mu t + 0.5 \sigma^2 t^2)$$

(the formula brakes down for large $\sigma$ or $t$ when the approximation of the truncated distribution with a non-truncated distribution fails).

The simulation below shows that this approximation can work

example

set.seed(1)

### generate data from exponential distribution
### with variable rate
n = 10^5
lambda = rnorm(n,1,0.2)
t = rexp(n,lambda)

### order data for plotting as
### emperical survival curve
t = t[order(t)]
p = c(1:n)/n

### plotting
plot(t, 1-p, ylab = "P(T<=t)", main = "emperical survival curve \n t ~ exp(lambda)\n lambda ~ N(1,0.04)", type = "l", log = "y")

### compare two models
lines(t,exp(-t+0.2^2/2*t^2), col = 4, lty = 2)
lines(t,exp(-(1)*t), col = 2, lty = 2)