R Regression – Simulating Variability in Fitting Gamma Model to Survival Data Using Generalized Minimum Extreme Value Distribution

extreme valuerregressionsimulationsurvival

As shown below and per the R code at the bottom, I plot a base survival curve for the lung dataset from the survival package using a fitted gamma model (plot red line) with flexsurvreg() and run simulations (plot blue lines) by drawing on random samples for the $W$ random error term of the standard gamma distribution equation $Y=logT=α+W$ per section 1.5 of the Rodríguez notes at https://grodri.github.io/survival/ParametricSurvival.pdf. In drawing on random samples for $W$, I am trying to illustrate the fundamental variability imposed by fitting a model to the data by using a generalized standard minimum extreme value distribution per the Rodríguez notes, which states: "$W$ has a generalized extreme-value distribution with density…controlled by a parameter $k$ as excerpted below"

enter image description here

In the plot below, I run 5 simulations by repeatedly running the last line of code and as you can see the simulations are far from the fitted gamma curve in red. Those simulations should surround the fitted gamma, loosely following its shape. The best I've been able to do in generating random samples $W$ and loosely following the shape of the fitted gamma curve is by using rgamma() as shown in the below code and plotted below, and for getting closer to the fitted gamma curve although not following its curvature is rexp() (commented-out in the code) which is a bit nonsensical and neither of these two are generalized extreme values either. I also try rgpd() from the evd package (commented-out) for generating random samples from a generalized extreme value distribution, and those results are nonsensical. Similar bad outcomes using evd::rgev().

What am I doing wrong in my interpretation of sampling from the generalized extreme-value distribution? And how do I get the simulations to form a broad band around the gamma curve?

enter image description here

Code:

library(evd)
library(flexsurv)
library(survival)

time <- seq(0, 1000, by = 1)
fit <- flexsurvreg(Surv(time, status) ~ 1, data = lung, dist = "gamma")
shape <- exp(fit$coef["shape"])
rate <- exp(fit$coef["rate"])

survival <- 1-pgamma(time, shape = shape, rate = rate)

# Generate random distribution parameter estimates for simulations
simFX <- function(){
    W <- log(rgamma(100,shape = shape, scale = 1/rate))
    # W <- log(rexp(100, rate))
    # w <- log(evd::rgpd(100,shape = shape, scale = 1/rate))
    newTimes <- exp(log(shape) + W)
    newFit <- flexsurvreg(Surv(newTimes) ~ 1, data = lung, dist = "gamma")
    newFitShape <- exp(newFit$coef["shape"])
    newFitRate <- exp(newFit$coef["rate"])
    return(1-pgamma(time, shape=newFitShape, rate=newFitRate))
  }

plot(time,survival,type="n",xlab="Time",ylab="Survival Probability", main="Lung Survival (gamma)")

lines(survival, type = "l", col = "red", lwd = 3) # plot base fitted survival curve
lines(simFX(), col = "blue", lty = 2) # run this line repeatedly for SIMULATION

Edit: [There was previously an Edit 1 reflecting an error and Edit 2 correcting Edit 1. Edit 1 has been deleted and Edit 2 switched to the final "Edit"]. This Edit adds a Weibull example showing the simulation of $W$ when characterizing Weibull by $Y = log T = α + σW$,
where $W$ has the extreme value distribution, $α = − logλ$ and $p = 1/σ$. This example generates samples by: (1) randomizing $W$, (2) combining the randomized $W$ with the estimates of $α$ and $σ$ from the model fit to the lung data following the above equation form, (3) exponentiating the last item per the above equation form to get survival-time simulations (note in newTimes <- exp(fit$icoef[1] + exp(fit$icoef[2])*W) of the code below, how $σ$ (fit$icoef[2]) is exponentiated since in its native form it is in the log scale) and the whole thing $α+σW$ is exponentiated again as a unit), (4) running each new survreg() fit. Code:

library(survival)
time <- seq(0, 1000, by = 1)
fit <- survreg(Surv(time, status) ~ 1, data = lung, dist = "weibull")

weibCurve <- function(time, survregCoefs){exp(-(time/exp(survregCoefs[1]))^exp(-survregCoefs[2]))}
survival <- weibCurve(time, fit$icoef)
simFX <- function(){
  W <- log(rexp(165))
  newTimes <- exp(fit$icoef[1] + exp(fit$icoef[2])*W)
  newFit <- survreg(Surv(newTimes)~1,dist="weibull")
  params <- c(newFit$icoef[1],newFit$icoef[2])
  return(weibCurve(time, params))
}
plot(time,survival,type="n",xlab="Time",ylab="Survival Probability", 
     main="Lung Survival (Weibull) by sampling from W extreme-value distribution")
invisible(replicate(500,lines(simFX(), col = "blue", lty = 2))) # run this line to add simulations to plot
lines(survival, type = "l", col = "yellow", lwd = 3) # plot base fitted survival curve

Illustration of running above simFX() 500 times (using replicate() function):

enter image description here

Best Answer

Ultimately, if you have the same reasonably large number of events and the underlying parametric model is adequate, the distribution of coefficient estimates from repeated sampling of the underlying distribution should be close to the multivariate normal distribution from the original fit.

There's no need to put this in the $\log T = \alpha + W$ accelerated failure time format. Once you have fixed shape and rate parameters in the form expected by pgamma() and rgamma(), just sample directly from rgamma() for event-time simulations. With your shape and rate values from the fit to the lung data set, to mimic the 165 events in that data set and overlay onto the Kaplan-Meier curve:

plot(survfit(Surv(time, status) ~ 1, data = lung))
newGammaData <- rgamma(165,shape=shape,rate=rate)
newGammaFit <- flexsurvreg(Surv(newGammaData, rep(1,165))~1,dist="gamma") 
lines(1-pgamma(time, exp(newGammaFit$coef["shape"]), exp(newGammaFit$coef["rate"])),col="red")

Repeat the last 3 lines as often as you'd like.

To compare against the variability of coefficient estimates from the original model, sample from the bivariate normal distribution of the coefficients as they are reported by flexsurvreg(), exponentiate (as you do) to get values that are expected by pgamma(), then add lines as you show. Try the following:

newCoef <- MASS::mvrnorm(1,coef(fit),vcov(fit))
lines(1-pgamma(time,exp(newCoef[["shape"]]),exp(newCoef[["rate"]])),col="blue")

and repeat both code lines as often as you would like.

Direct comparison of covariances among fits to simulated data and covariance matrix of coefficient estimates

To illustrate how well the asymptotically bivariate normal distribution of coefficient estimates corresponds to what you find by multiple fits to data drawn from the modeled survival distribution, get the gamma fit to the original data (with 165 events), then combine fit results from 200 samples (of 165 uncensored simulated event times each) from the modeled gamma fit.

## fit lung data
library(flexsurv)
lungGammaFit <- flexsurvreg(Surv(time, status) ~ 1, data = lung, dist = "gamma")
## get coefficients in form for rgamma()
shape <- exp(lungGammaFit$coef["shape"])
rate <- exp(lungGammaFit$coef["rate"])
## initialize for accumulation over reps
bar <- rep(0,2) ## to keep coefficient estimates
cov <- matrix(0,nrow=2,ncol=2) ## to keep covariances
set.seed(104)
## do 200 samples of 165 each and fit
for (fitCount in 1:200) {newData <- rgamma(165,shape=shape,rate=rate);
      newFit <- flexsurvreg(Surv(newData, rep(1,165))~1,dist="gamma");
      cof<-coef(newFit);
      bar <- bar+cof;
      cof<-as.matrix(cof);
      cov<-cov+cof%*%t(cof)}
## take average over 200 reps
(barAve <- bar/200)
#     shape       rate 
# 0.4001778 -5.5644051 
## expected good agreement with original fit
coef(lungGammaFit)
#     shape       rate 
# 0.3908883 -5.5839805
## multivariate coefficient covariance estimate
barAve <- as.matrix(barAve)
newVcov <- (cov-200*barAve %*% t(barAve))/199
newVcov
#            shape       rate
# shape 0.01090990 0.01145029
# rate  0.01145029 0.01592581
## even covariances are very close
vcov(lungGammaFit)
#             shape       rate
# shape 0.009105897 0.01050332
# rate  0.010503320 0.01603337

That illustrates the first paragraph of the answer: the variability among fits to multiple samples from the gamma-distribution fit to the lung data is essentially what you would get from the original variance-covariance matrix of coefficient estimates, if the number of events is the same.