R Survival Analysis – Why Best Fitting Weibull Distribution Deviates from Actual Data

gumbel distributionrsurvivalweibull distribution

I started working with the Gumbel distribution and fit it to the lung dataset to try it out. I then compared it with the survival curve using the Weibull distribution, which provides the best fit per goodness-of-fit tests and also hews closely to the Kaplan-Meier plot as shown below.

When averaging the death rate in the lung data (status = 2 is death; all status 2's divided by a total of 228 elements in lung data) the death rate is 72.4%. This compares to a death rate for Gumbel fit of 72.7% (see death_rate_Gumbel in below code) and a death rate for Weibull fit of 63.2% (see death_rate_Weibull below). Shouldn't the Weibull death rate be close to the actual death rate for lung dataset? What am I doing wrong, or misinterpreting?

enter image description here

Code:

library(evd)
library(fitdistrplus)
library(survival)

time <- seq(0, 1022, by = 1)

# Gumbel distribution
deathTime <- lung$time[lung$status == 2]
scale_est <- (sd(deathTime)*sqrt(6))/pi
loc_est <- mean(deathTime) + 0.5772157*scale_est
fitGum <- fitdistrplus::fitdist(deathTime, "gumbel",start=list(a = loc_est, b = scale_est)) 
survGum <- 1-evd::pgumbel(time, fitGum$estimate[1], fitGum$estimate[2])

# Weibull distribution
survWeib <- function(time, survregCoefs) {exp(-(time / exp(survregCoefs[1]))^exp(-survregCoefs[2]))}
fitWeib <- survreg(Surv(time, status) ~ 1, data = lung, dist = "weibull")

# plot all
plot(time,survGum,type="n",xlab="Time",ylab="Survival Probability", main="Lung Survival")
lines(survGum, type = "l", col = "red", lwd = 3) # plot Gumbel
lines(survWeib(time, fitWeib$icoef),type = "l",col = "blue",lwd = 3) # plot Weibull
lines(survfit(Surv(time, status) ~ 1, data = lung), col = "black", lwd = 1) # plot K-M
legend("topright",legend = c("Gumbel","Weibull","Kaplan-Meier"),col = c("red", "blue","black"),lwd = c(3,3,1),bty = "n")

# death rates
death_rate_Weibull <- 1-mean(survWeib(time, fitWeib$icoef))
death_rate_Gumbel <- 1-mean(survGum)

Best Answer

Putting the EdM comment into code, here are three options for estimating median survival/death rates:

  1. Use the standard parameterization for Weibull of $λ(ln2)(1/α)$ with scale parameter $λ$ and shape parameter $α$ per the answer provided in Why am I not able to correctly calculate the median survival time for the Weibull distribution?
  2. Use quantiles
  3. Use the survival fraction at time $X$

Code:

### METHOD 1: MEDIAN FORMULA ###
median_surv <- exp(fitWeib$icoef[1])*(log(2))^(1/exp(fitWeib$icoef[2]))
death_rate_Weib <- 1-median_surv/max(lung$time)

### METHOD 2: QUANTILES ###
# median survival times
median_surv_Weib <- qweibull(0.5, shape = exp(fitWeib$icoef[2]), scale = exp(fitWeib$icoef[1]))
# median death rates
death_rate_Weib <- 1 - median_surv_Weib/max(lung$time)

### METHOD 3: MIDPOINT SURVIVAL ###
# median survival percentage
surv_rate_Weib <- survWeib(max(lung$time)/2, fitWeib$icoef)
surv_rate_Gumb <- 1-evd::pgumbel(max(lung$time)/2, fitGum$estimate[1], fitGum$estimate[2])
# median death percentage
death_rate_Weib <- 1-surv_rate_Weib
death_rate_Gumb <- 1-surv_rate_Gumb
Related Question