Survival – Modeling Uncertainty in Exponential Distribution for Survival Simulations

bootstrapexponential distributionrsimulationsurvival

In the code shown at the bottom of this post, I plot survival curves for the lung dataset from the survival package using a fitted exponential model, using the K-M nonparametric model, and run/show simulations using the exponential model.

I use bootstrapping, resampling from the original data with replacement to create multiple bootstrap samples using sample(). For each bootstrap sample, the code fits the exponential distribution using the survreg() function. This process is repeated, generating a distribution of estimates, representing the variability and uncertainty of the exponential statistical model.

My objective with this ultimately is given a partial survival curve (say 500 periods of the lung dataset), generating conservative simulations for periods 501-1000. I don't show that in this code example. When drafting similar code for the Weibull distribution, I use both bootstrapping (with sample() function) and additionally simulated uncertainty of the Weibull parameters using MASS:mvrnorm(), to derive a nicely dispersed range of simulation outcomes.

However, in this exponential model example, the exponential distribution has only one parameter, the rate (λ) parameter; so MASS:mvrnorm() makes no sense in this case. To introduce more dispersion in outcomes in the below code I use rnorm(1, mean = 0, sd = 0.05) in the sim_params section (all commented out in the code and in the below illustration to not introduce this additional uncertainty factor), which as the code is currently drafted is subjective (by manually inputting the SDEV value) and not grounded in the actual data unlike my use of MASS:mvrnorm() for the Weibull distribution.

So my questions are (1) is there a way to ground this parameter uncertainty factor (sim_params...) in the actual lung data? and (2) is this method of modeling uncertainty both using bootstrapping with sample() and modeling uncertainty in the distribution parameters themselves (in the sim_params section) theoretically valid?

The image below only shows the results of running the code with only bootstrap resampling functioning, and showing a run of 2000 simulations:

enter image description here

Code:

library(survival)

num_simulations <- 2000

# Fit the exponential model to the dataset
fit <- survreg(Surv(time, status) ~ 1, data = lung, dist = "exponential")

time <- seq(0, 1000, by = 1)

# Compute the exponential survival function using fitted model
survival <- 1 - pexp(time, rate = 1 / exp(fit$coef))

# Generate bootstrap samples and fit exponential models to each sample
bootstrap_fits <- lapply(1:num_simulations, function(i) {
  sample_data <- lung[sample(nrow(lung), replace = TRUE), ]
  fit <- survreg(Surv(time, status) ~ 1, data = sample_data, dist = "exponential")
  return(fit)
})

# Generate random distribution parameter estimates for simulations
sim_params <- sapply(bootstrap_fits, function(fit) {
  rate <- fit$coef 
  params <- rate # this is a bypass of "perturbation" below
  # perturbation <- rnorm(1, mean = 0, sd = 0.05)  # Adjust sd for simulation dispersion
  # perturbed_rate <- rate + perturbation
  # params <- perturbed_rate
  return(params)
})

# Compute the survival curves for each simulation using the sampled parameters
sim_curves <- sapply(
  1:num_simulations, 
  function(i) 1 - pexp(time, rate = 1 / exp(sim_params[i]))
)

plot(time, survival, type = "n", xlab = "Time", ylab = "Survival Probability", 
     main = "Survival Plot of Lung Dataset")

sim_lines <- data.frame(
  time = time, 
  do.call(cbind, lapply(1:num_simulations, function(i) {
    curve <- sim_curves[, i]
    lines(time, curve, col = "lightblue", lty = "solid", lwd = 0.25)
    return(curve)
  })))

colnames(sim_lines)[-1] <- paste0("surv", 1:num_simulations) 

# Compute and add to the plot the Kaplan-Meier survival curve for the dataset
lines(survfit(Surv(time, status) ~ 1, data = lung), col = "blue", lwd = 1)

# Plot the exponential survival curve
lines(time, survival, type = "l", xlab = "Time", ylab = "Survival Probability", col = "red", lwd = 3)

legend("topright", 
       legend = c("Fitted exponential model", 
                  "Kaplan-Meier & confidence intervals", 
                  "Simulations"), 
       col = c("red", "blue", "lightblue"),
       lwd = c(3, 1, 0.25),
       lty = c(1, 1, 1), # 1 = solid, 2 = dashed
       bty = "n")

Best Answer

You have to separate out some different types of "uncertainty" here.

The models you fit take the form:

$$\log(T)\sim \beta_0 + W, $$

where $\beta_0$ is your fit$coef and $W$ represents a standard minimum extreme value distribution.

From the perspective of individuals modeled this way, the distribution of $W$ represents a major source of uncertainty. Even if you know $\beta_0$ exactly, the event times among individuals will have a wide distribution, in this case following (in the log scale of time) a standard minimum extreme value distribution.

The next source of uncertainty is in your estimate of $\beta_0$. Under the theory of fitting such a model via maximum likelihood, the estimate of $\beta_0$ has an asymptotically normal distribution. In this situation with only one coefficient to estimate, that's just a simple case of the more general multivariate normal distribution of multiple coefficient estimates. You can get that asymptotic normal estimate directly from the first exponential fit to the lung data set similarly to how you would with more complicated models

fit$icoef
# (Intercept) 
#   6.044474  
sqrt(vcov(fit))
#             (Intercept)
# (Intercept)  0.07784989

and sample in this situation from the corresponding one-dimensional normal distribution. That sampling should be done in this scale of coefficient estimates before you do any transformation to exponential rate values. The data themselves thus provide this estimate of uncertainty in modeling. Your subjective choice of a standard deviation value is unnecessary.

Bootstrapping provides a different estimate of uncertainty in modeling, by repeating the modeling on multiple bootstrap samples of the full original data set. Among other things, that can provide a check on how well the assumption of asymptotic normality of the original coefficient estimates holds. Ideally, the distribution of coefficient estimates among fits to bootstrap samples should be similar to the normal distribution estimated in the original model.

Bootstrapping also can be used to estimate the "optimism" in coefficient estimates due to overfitting and to generate optimism-corrected calibration curves. See the validate() and calibrate() functions of the rms package in R.

If your ultimate interest is in the uncertainty of event times among individuals, however, then you also must consider the fundamental variability imposed by the underlying minimum extreme value distribution. In practice, that typically overwhelms the variability in the estimates of $\beta_0$.

Here's an example of how little variability in a coefficient estimate can matter. Here are the distributions of 300 individual log-survival times drawn from each of the following exponential distributions: at the point estimate of your rate, and at rates equivalent to its upper and lower 95% limits (see note *** below).

log-survival time distributions

You could make the same point analytically, but this has the advantage of also displaying the sampling variability given the specified distributions. The differences associated with the error in the coefficient estimates are essentially lost among the overall widths of the distributions.

*** These simulations, as illustrated in this image and per the code below, are of event times, not model parameters. Technically, this is not sampling directly from a standard minimum extreme value distribution $W$; this example takes advantage of the simplicity of the exponential model (with scale $σ$ in the term $σW$ fixed at 1) to sample directly from an exponential survival distribution with a fixed rate parameter. This following link shows a correct way to sample from a minimum extreme value distribution for this general type of parametric survival model: Simulate a Weibull regression model

Code:

set.seed(203)
Point_Est <- rexp(300, rate=1/exp(6.044474))
LCL_Est <- rexp(300, rate=1/exp(6.044474+1.96*0.07784989))
UCL_Est <- rexp(300, rate=1/exp(6.044474-1.96*0.07784989))
plot(density(log(UCL_Est)), col="red", bty="n",
            xlab="log(Survival Time)", ylab="density", 
            main="Survival time distributions")
lines(density(log(Point_Est)), col="black")
lines(density(log(LCL_Est)) ,col="blue")
legend("topleft", bty="n", legend = 
"Black, point estimate\nRed, 95% upper limit for rate\nBlue, 95% lower limit for rate")