R Survival Analysis – Why Best Fitting Weibull Distribution Deviates from Actual Data

gumbel distributionrsurvivalweibull distribution

I started working with the Gumbel distribution and fit it to the lung dataset to try it out. I then compared it with the survival curve using the Weibull distribution, which provides the best fit per goodness-of-fit tests and also hews closely to the Kaplan-Meier plot as shown below.

When averaging the death rate in the lung data (status = 2 is death; all status 2's divided by a total of 228 elements in lung data) the death rate is 72.4%. This compares to a death rate for Gumbel fit of 72.7% (see death_rate_Gumbel in below code) and a death rate for Weibull fit of 63.2% (see death_rate_Weibull below). Shouldn't the Weibull death rate be close to the actual death rate for lung dataset? What am I doing wrong, or misinterpreting?

Code:

library(evd)
library(fitdistrplus)
library(survival)

time <- seq(0, 1022, by = 1)

# Gumbel distribution
deathTime <- lung$time[lung$status == 2]
scale_est <- (sd(deathTime)*sqrt(6))/pi
loc_est <- mean(deathTime) + 0.5772157*scale_est
fitGum <- fitdistrplus::fitdist(deathTime, "gumbel",start=list(a = loc_est, b = scale_est)) 
survGum <- 1-evd::pgumbel(time, fitGum$estimate[1], fitGum$estimate[2])

# Weibull distribution
survWeib <- function(time, survregCoefs) {exp(-(time / exp(survregCoefs[1]))^exp(-survregCoefs[2]))}
fitWeib <- survreg(Surv(time, status) ~ 1, data = lung, dist = "weibull")

# plot all
plot(time,survGum,type="n",xlab="Time",ylab="Survival Probability", main="Lung Survival")
lines(survGum, type = "l", col = "red", lwd = 3) # plot Gumbel
lines(survWeib(time, fitWeib$icoef),type = "l",col = "blue",lwd = 3) # plot Weibull
lines(survfit(Surv(time, status) ~ 1, data = lung), col = "black", lwd = 1) # plot K-M
legend("topright",legend = c("Gumbel","Weibull","Kaplan-Meier"),col = c("red", "blue","black"),lwd = c(3,3,1),bty = "n")

# death rates
death_rate_Weibull <- 1-mean(survWeib(time, fitWeib$icoef))
death_rate_Gumbel <- 1-mean(survGum)

Best Answer

Putting the EdM comment into code, here are three options for estimating median survival/death rates:

Use the standard parameterization for Weibull of $λ(ln2)(1/α)$ with scale parameter $λ$ and shape parameter $α$ per the answer provided in Why am I not able to correctly calculate the median survival time for the Weibull distribution?
Use quantiles
Use the survival fraction at time $X$

Code:

### METHOD 1: MEDIAN FORMULA ###
median_surv <- exp(fitWeib$icoef[1])*(log(2))^(1/exp(fitWeib$icoef[2]))
death_rate_Weib <- 1-median_surv/max(lung$time)

### METHOD 2: QUANTILES ###
# median survival times
median_surv_Weib <- qweibull(0.5, shape = exp(fitWeib$icoef[2]), scale = exp(fitWeib$icoef[1]))
# median death rates
death_rate_Weib <- 1 - median_surv_Weib/max(lung$time)

### METHOD 3: MIDPOINT SURVIVAL ###
# median survival percentage
surv_rate_Weib <- survWeib(max(lung$time)/2, fitWeib$icoef)
surv_rate_Gumb <- 1-evd::pgumbel(max(lung$time)/2, fitGum$estimate[1], fitGum$estimate[2])
# median death percentage
death_rate_Weib <- 1-surv_rate_Weib
death_rate_Gumb <- 1-surv_rate_Gumb

Related Solutions

Multiple Forecast Simulations – Generating Paths for Survival Analysis

First, as you are using survfit() to fit your lung1 data, your simulations aren't using any information about a Weibull fit to those data. Second, the "standard" Weibull parameterization used by Wikipedia and by dweibull() in R differs from that used by survreg() or flexsurvreg() as you try in another question, providing a good deal of potential confusion. Third, if you want to get smooth estimates over time, then you have to ask for them. It seems that your simulations here and in related questions ask for some type of point estimate or random sample from the distribution rather than a smooth curve.

Random samples from the event distribution are OK and are used for things like power analysis in complex designs. For your application you would need, however, a lot of random samples from each set of new random Weibull parameters to put together to get the estimated survival curves you want. That's unnecessary, as with a parametric fit (unlike the time-series estimates you've used in other work) there is a simple closed form for the survival curve, providing the basis for the continuous predictions that you want.

In the "standard" parameterization used by Wikipedia and by dweibull() in R, the Weibull survival function is:

$$ S(x) = \exp\left( -\left(\frac{x}{\lambda} \right)^k \right),$$

where $\lambda$ is the standard "scale" and $k$ is the standard "shape."

Neither survreg() nor flexsurvreg() (which calls survreg() for this type of model) fits the model based on that parameterization. Although flexsurvreg() can report coefficients and standard errors in that parameterization, the internal storage that you access with functions like coef() and vcov() uses a different parameterization.

To get the "standard" scale, you need to exponentiate the linear predictor returned by a fit based on survreg(). If there are no covariates, then that's just exp(Intercept).

To get the "standard" shape, you need to take the inverse of the survreg_scale. The coefficient stored by survreg() or flexsurvreg is the log of survreg_scale, so you can get the "standard shape" via exp(-log(survreg_scale)).

Further complicating things is that survreg(), unlike flexsurvreg(), doesn't return log(scale) via the coef() function. You can, however, get that along with the other coefficients by asking for model$icoef, which returns all coefficients in the same order that they appear in vcov().

The following function returns the survival curve for a Weibull fit from survreg(). The survregCoefs argument should be a vector with the first component the linear predictor and the second the log(scale) from survreg().

weibCurve <- function(time, survregCoefs) {
               exp(-(time/exp(survregCoefs[1]))^exp(-survregCoefs[2]))
               }

Fit a Weibull distribution to the data and compare the fit to the raw data:

## fit Weibull
fit1 <- survreg(Surv(time1, status1) ~ 1, data = lung1)
## plot raw data as censored
plot(survfit(Surv(time1, status1) ~ 1, data = lung1),
       xlim = c(0, 1000), ylim = c(0, 1), bty = "n", 
       xlab = "Time", ylab = "Fraction surviving")
## overlay Weibull fit
curve(weibCurve(x, fit1$icoef), from = 0, to = 1000, add = TRUE, col = "red")

Then you can sample from the distribution of coefficient estimates and repeat the following as frequently as you like to see the variability in estimates (assuming that the Weibull model is correct for the data). I set a seed for reproducibility.

set.seed(2423)
## repeat the following as needed to add randomized predictions for late times.
## I did both 5 times to get the posted plot.
newCoef <- MASS::mvrnorm(n = 1, fit1$icoef, vcov(fit1))
curve(weibCurve(x, newCoef), from = 500, to = 1000, add = TRUE, col = "blue", lty = 2)

That leads to the following plot.

Another approach to getting the variability of projections into the future from the model is to get a distribution of "remaining useful life" values for multiple random samples of Weibull coefficient values, conditional upon survival to your last observation time (500 here). This page shows the formula.

Best Answer

Related Solutions

Multiple Forecast Simulations – Generating Paths for Survival Analysis

Related Question