Survival Analysis – How to Simulate an Extreme Value Distribution Correctly

exponential distributionextreme valuersimulationsurvival

In the image and per the code at the bottom of this post, I plot survival curves for the lung dataset from the survival package using a fitted exponential distribution model (plot red line), using the K-M nonparametric model (plot blue lines), and run/show simulations using the exponential model (plot light blue lines) with the mean of the simulations shown as the black line. Exponential doesn't provide the best fit for lung data but I'm trying to better understand modeling with exponential and extreme value distributions.

The model fit takes the form $log(T)$$β_0 + W$ where $β_0$ is the fit$coef and $W$ represents a standard minimum extreme value distribution (exponential in this case). In the example presented herein I only model $W$; in related posts I address the modeling of $β_0$ such as in How to appropriately model the uncertainty of the exponential distribution model when running survival simulations?. The simulations herein are, for a change, from the perspective of individuals, modeled by generating random intercepts for the exponential distribution in the line of code simPaths <- sapply(1:simNbr,function(i) 1-pexp(time,rate=1/rexp(1,rate = 1/exp(fit$icoef)))).

My question is why does the mean of the simulations shown in the black line differ so widely differ from the fitted base exponential model shown in the red line? When I simulated only the $β_0$ uncertainty in related posts, the simulations formed a band around the fitted base distribution similar in width more or less to the 95% CI around the Kaplan-Meier curve (the dashed blue lines). What am I doing wrong?

enter image description here

Code:

library(survival)

simNbr <- 25

time <- seq(0, 1000, by = 1)

fit <- survreg(Surv(time, status) ~ 1, data = lung, dist = "exponential")

# Compute exponential survival function for the base fitted model
survival <- 1 - pexp(time, rate = 1/exp(fit$coef))

# Compute the survival curves for each simulation
simPaths <- sapply(1:simNbr,function(i) 1-pexp(time,rate=1/rexp(1,rate = 1/exp(fit$icoef))))

plot(time,survival,type="n",xlab="Time",ylab="Survival Probability",main="Lung Data Survival Plot")

# Plot simulations
plotSims <- data.frame(time = time, 
  do.call(cbind, 
          lapply(1:simNbr,function(i) {
                 lines(time, simPaths[, i], col = "lightblue", lty = "solid", lwd = 0.25)
                 return(curve)
                }
          )
        )
)

# Add average of simulations
avgSurvival <- apply(simPaths, 1, mean)
lines(time, avgSurvival, col = "black", lwd = 3)

# Add Kaplan-Meier survival curve for the lung data
lines(survfit(Surv(time, status) ~ 1, data = lung), col = "blue", lwd = 1)

# Plot the base fitted survival curve using exponential
lines(cbind(time, survival), type = "l", col = "red", lwd = 3)

legend("topright", 
       legend = c("Fitted exponential model","K-M & confidence intervals","Simulations", "Simulation mean"), 
       col = c("red", "blue", "lightblue", "black"),lwd = c(3, 1, 0.25, 3),lty = c(1, 1, 1, 1), 
       bty = "n")

Best Answer

I studied Simulate a Weibull regression model and my key takeaway is this (as modified slightly by me since I am trying to model $W$ in the model equation described in the OP):

W <- log(rexp(1000))
survreg(Surv(exp(W))~1,dist="exponential")

Running the above results in an intercept for $β_0$ near 0. Applying this method to my OP question and code, I get the output illustrated below with the code at the bottom. The code section that adopts the above is in simParams: W <- log(rexp(100)), fit <- survreg(Surv(exp(W))~1,dist="exponential"), and params <- coefLung + fit$icoef.

Though this is visually pleasing to my novice eye, a doubt I have is in rexp(100), where I set the 100 arbitrarily. A greater number of samples results in less dispersion, and a lesser number of samples the opposite. Is there an accepted method for setting the number of samples? Perhaps I should have used the number of elements (228) in the lung dataset? Maybe this is better addressed in another post.

enter image description here

Code:

simNbr <- 1000
time <- seq(0, 1000, by = 1)
fitLung <- survreg(Surv(time, status) ~ 1, data = lung, dist = "exponential")
coefLung <- fitLung$icoef

# Compute exponential survival function for the base fitted model
survival <- 1 - pexp(time, rate = 1/exp(coefLung))

# Generate random distribution parameter estimates for simulations
simParams <- sapply(
  1:simNbr,
  function(i){
    W <- log(rexp(100)) # note the number of random values which has a large impact on dispersion
    fit <- survreg(Surv(exp(W))~1,dist="exponential")
    params <- coefLung + fit$icoef
    return(as.vector(params))
    }
)

# Compute the survival curve for each simulation
simPaths <- sapply(
  1:simNbr, 
  function(i) 1 - pexp(time, rate = 1 / exp(simParams[i]))
)

# Set up plot shell
plot(time,survival,type="n",xlab="Time",ylab="Survival Probability",main="Lung Data Survival Plot")

# Plot simulations
plotSims <- data.frame(
  time = time, 
  do.call(cbind, 
          lapply(1:simNbr,function(i) {
            lines(time, simPaths[, i], col = "lightblue", lty = "solid", lwd = 0.25)
            return(curve)
            }
           )
         )
)

# Add average of simulations
avgSurvival <- apply(simPaths, 1, mean)
lines(time, avgSurvival, col = "black", lwd = 1)

# Add Kaplan-Meier survival curve for the lung data
lines(survfit(Surv(time, status) ~ 1, data = lung), col = "blue", lwd = 1)

# Plot the base fitted survival curve using exponential
lines(cbind(time, survival), type = "l", col = "red", lwd = 3)

legend("topright", 
       legend = c("Fitted exponential model","K-M & confidence intervals","Simulations", "Simulations mean"), 
       col = c("red", "blue", "lightblue", "black"),lwd = c(3, 1, 0.25, 3),lty = c(1, 1, 1, 1), 
       bty = "n"
       )