Solved – Estimating expected lifetime from hazard ratio and estimated base hazard function

boostingcox-modelregressionsurvival

Apologies if this is a basic question, I am not very familiar with survival analysis …

I have trained a gradient boosted Cox proportional hazards model in R, and have been able to obtain reasonable hazard function estimates from through the gbm package's basehaz.gbm function.

My goal though is to obtain expected lifetimes. How does one go about obtaining these estimates given an estimated base hazard function and hazard ratios? Would it be best to find a parametric fit to the hazard function (or survival function)? Is this usually handled by a discrete approximation using the estimated survival function (and how would one account for the part of the curve unable to be estimated past a certain time due to sparsity of data)?

I've tried both approaches but have not gotten good results, and the standard parametric survival functions don't seem to fit my estimated curve particularly well.

Best Answer

The estimation of life-expectancy in the presence of censored data necessarily requires assumptions on the unobserved part of the survival function. A parametric distribution can be used for extrapolation of observed and expected survival, but it is not easy to capture the shape of the unobserved survival function. A possible approach is to use relative survival (see for instance: Andersson et al. 2012)

If you wish to avoid data extrapolation, this can be done by evaluating survival percentiles. With a sufficient number of cases you can estimate the median survival, while with a proportion of cases below 50% lower percentiles can be estimated.

Survival percentiles can be calculated from the Kaplan-Meier estimator, which summarizes the observed survival. If you are interested in adjusted survival percentiles you may take a look at Laplace regression (see for instance Orsini et al. 2012).

Related Solutions

Survival Analysis – Why Use Semi-Parametric Models (Cox Proportional Hazards) Instead of Fully Parametric Models?

If you know the parametric distribution that your data follows then using a maximum likelihood approach and the distribution makes sense. The real advantage of Cox Proportional Hazards regression is that you can still fit survival models without knowing (or assuming) the distribution. You give an example using the normal distribution, but most survival times (and other types of data that Cox PH regression is used for) do not come close to following a normal distribution. Some may follow a log-normal, or a Weibull, or other parametric distribution, and if you are willing to make that assumption then the maximum likelihood parametric approach is great. But in many real world cases we do not know what the appropriate distribution is (or even a close enough approximation). With censoring and covariates we cannot do a simple histogram and say "that looks like a ... distribution to me". So it is very useful to have a technique that works well without needing a specific distribution.

Why use the hazard instead of the distribution function? Consider the following statement: "People in group A are twice as likely to die at age 80 as people in group B". Now that could be true because people in group B tend to live longer than those in group A, or it could be because people in group B tend to live shorter lives and most of them are dead long before age 80, giving a very small probability of them dying at 80 while enough people in group A live to 80 that a fair number of them will die at that age giving a much higher probability of death at that age. So the same statement could mean being in group A is better or worse than being in group B. What makes more sense is to say, of those people (in each group) that lived to 80, what proportion will die before they turn 81. That is the hazard (and the hazard is a function of the distribution function/survival function/etc.). The hazard is easier to work with in the semi-parametric model and can then give you information about the distribution.

Survival Analysis – How to Create a Toy Survival (Time to Event) Data with Right Censoring

It is not clear to me how you generate your event times (which, in your case, might be $<0$) and event indicators:

time = rnorm(n,10,2) 
S_prob = S(time)
event = ifelse(runif(1)>S_prob,1,0)

So here is a generic method, followed by some R code.

Generating survival times to simulate Cox proportional hazards models

To generate event times from the proportional hazards model, we can use the inverse probability method (Bender et al., 2005): if $V$ is uniform on $(0, 1)$ and if $S(\cdot \,|\, \mathbf{x})$ is the conditional survival function derived from the proportional hazards model, i.e. $$ S(t \,|\, \mathbf{x}) = \exp \left( -H_0(t) \exp(\mathbf{x}^\prime \mathbf{\beta}) \vphantom{\Big(} \right) $$ then it is a fact that the random variable $$ T = S^{-1}(V \,|\, \mathbf{x}) = H_0^{-1} \left( - \frac{\log(V)}{\exp(\mathbf{x}^\prime \mathbf{\beta})} \right) $$ has survival function $S(\cdot \,|\, \mathbf{x})$. This result is known as ``the inverse probability integral transformation''. Therefore, to generate a survival time $T \sim S(\cdot \,|\, \mathbf{x})$ given the covariate vector, it suffices to draw $v$ from $V \sim \mathrm{U}(0, 1)$ and to make the inverse transformation $t = S^{-1}(v \,|\, \mathbf{x})$.

Example [Weibull baseline hazard]

Let $h_0(t) = \lambda \rho t^{\rho - 1}$ with shape $\rho > 0$ and scale $\lambda > 0$. Then $H_0(t) = \lambda t^\rho$ and $H^{-1}_0(t) = (\frac{t}{\lambda})^{\frac{1}{\rho}}$. Following the inverse probability method, a realisation of $T \sim S(\cdot \,|\, \mathbf{x})$ is obtained by computing $$ t = \left( - \frac{\log(v)}{\lambda \exp(\mathbf{x}^\prime \mathbf{\beta})} \right)^{\frac{1}{\rho}} $$ with $v$ a uniform variate on $(0, 1)$. Using results on transformations of random variables, one may notice that $T$ has a conditional Weibull distribution (given $\mathbf{x}$) with shape $\rho$ and scale $\lambda \exp(\mathbf{x}^\prime \mathbf{\beta})$.

R code

The following R function generates a data set with a single binary covariate $x$ (e.g. a treatment indicator). The baseline hazard has a Weibull form. Censoring times are randomly drawn from an exponential distribution.

# baseline hazard: Weibull

# N = sample size    
# lambda = scale parameter in h0()
# rho = shape parameter in h0()
# beta = fixed effect parameter
# rateC = rate parameter of the exponential distribution of C

simulWeib <- function(N, lambda, rho, beta, rateC)
{
  # covariate --> N Bernoulli trials
  x <- sample(x=c(0, 1), size=N, replace=TRUE, prob=c(0.5, 0.5))

  # Weibull latent event times
  v <- runif(n=N)
  Tlat <- (- log(v) / (lambda * exp(x * beta)))^(1 / rho)

  # censoring times
  C <- rexp(n=N, rate=rateC)

  # follow-up times and event indicators
  time <- pmin(Tlat, C)
  status <- as.numeric(Tlat <= C)

  # data set
  data.frame(id=1:N,
             time=time,
             status=status,
             x=x)
}

Test

Here is some quick simulation with $\beta = -0.6$:

set.seed(1234)
betaHat <- rep(NA, 1e3)
for(k in 1:1e3)
{
  dat <- simulWeib(N=100, lambda=0.01, rho=1, beta=-0.6, rateC=0.001)
  fit <- coxph(Surv(time, status) ~ x, data=dat)
  betaHat[k] <- fit$coef
}

> mean(betaHat)
[1] -0.6085473

Best Answer

Related Solutions

Survival Analysis – Why Use Semi-Parametric Models (Cox Proportional Hazards) Instead of Fully Parametric Models?

Survival Analysis – How to Create a Toy Survival (Time to Event) Data with Right Censoring

Related Question