Cox Model – Extrapolating Effect of Covariable Changes in Cox Proportional Hazards Models

cox-modelpredictionrsurvival

I have a Cox proportional hazards model in R (see made-up example below) that models the effect of some variable, say weight. From this model, I'd like to extrapolate what a change in weight from say 90 to 60 would mean to survival, taking into account the fact that for such a change occurring at say age 40, certain amount of risk has already accumulated (and assuming weight change is instantaneous).

I've attached some code which involves

  1. fitting the Cox model (using age as the time scale);
  2. extracting the predicted cumulative survival $S(t)$ using survfit for weight=90 and 60;
  3. getting the cumulative hazard $H(t) = -\log(S(t))$;
  4. getting the "instantaneous" hazard $h(t)$ via differencing $H(t)$ (plus small fudge factor to avoid zero hazard), which seems to do the job but probably a bit hacky;
  5. adding a constant to the $\log(h(t))$ for all timepoints after the change, equivalent to the $\beta$ coefficient from the Cox regression times the difference in weights (90-60=30);
  6. get the new survival functions $S^\prime(t)$ as $\exp(-{\rm cumsum}(\exp(\log(h^\prime(t)))))$.

This procedure produces reasonable results (plotted as $1 – S(t)$), but is it correct or am I just lucky?

enter image description here

library(survival)
set.seed(1)
rm(list=ls())

# Simulate some semi-realistic data
n      <- 1e3
age    <- round(runif(n, 1, 60))
weight <- round(rnorm(n, 70, 10))
height <- round(runif(n, 1.3, 1.9), 2)
sex    <- sample(c("M", "F"), length(age), replace=TRUE, prob=c(0.7, 0.3))
d.time <- ceiling(rexp(n, weight / 1e4))
cens   <- round(runif(n, 1, 60))
death  <- d.time <= cens
d.time <- pmin(d.time, cens)
d      <- data.frame(age=age, weight=weight, height=height, difftime=d.time, 
                     time=d.time + age, sex=sex, death=death)

s     <- coxph(Surv(age, time, death) ~ height + weight, data=d)
d.new <- data.frame(weight=c(60, 90), height=1.7)
sf    <- survfit(s, d.new)

# The cumulative hazard function H is -log(S(t)) where S(t) is the survivor function
# (aka cumulative survival)
S <- sf$surv[,2]

# Assume we start off with high weight
H <- -log(S)

# The hazard is the derivative (here, finite difference) of the cumulative hazard H
# But the hazard can't be zero exactly as when we take log hazard, won't make sense
h <- diff(c(0, H)) + 1e-6

# We introduce a changepoint in the hazard, but must make sure that the
# hazard does not become negative - this is naturally achieved because the
# Cox model is linear in the log-hazard. This means that the final survivor
# function will always be monotonically decreasing for any value of delta in 
# (-Inf, +Inf); delta > 0 increases hazard, delta < 0 decreases hazard
delta <- coef(s)["weight"] * (d.new$weight[1] - d.new$weight[2])
logh  <- log(h)
age   <- 40
logh[sf$time > age] <- logh[sf$time > age] + delta
h     <- exp(logh)

# Get the new cumulative hazard and new survivor functions
H <- cumsum(h)
S <- exp(-H)

# Compare original survivor function with modified one
plot(sf, lwd=5, col=1:2, conf.int=FALSE, mark=NA, fun="event",
     xlab="Age", ylab="Cumulative risk")
lines(c(0, sf$time), 1 - c(1, S), type="s", col=3, lwd=5)
abline(v=age, lty=2)
legend(x="topleft", legend=c("Weight=60", "Weight=90", "Weight decreased 90 to 60"),
       col=1:3, lwd=5)

Best Answer

I would suggest you do it non-parametrically. The procedure as you describe it imposes assumptions on the way the failure functions can relate to each other, basically because the Cox model introduces the assumption of proportional hazards. Therefore, I would argue that the red and black curves in the plot are a visualization of the model, more than they are estimates of failure functions. Not that those two things couldn't coincide, but why make this further assumption?

If you want to do something similar but non-parametrical, I would suggest using the Kaplan-Meier estimates instead. You would have to divide the weight variable into groups (assuming it's continuous), e.g. "low" and "high". You would still be able to do the counterfactual analysis that you want, simply by making a "conditional" KM plot similar to the green one above. So the green curve would be the KM of the "high" group until age $40$. At age $40$ the KM of the "low" kgs group (for $+40$ years) would continue, pasted onto the "high" ending at $40$. The KM estimate is the estimated probability of reaching age $t$, thus, for the hypothetical individual changing weight groups we can think of the probability of reaching age $40 + s$ as the probability of living from $40$ to $40 + s$ in the low weight group given survival until $40$ times the probability of living from $0$ to $40$ in the high weight group. This will exactly correspond to "pasting" the KM estimates together at age $40$. Note that the KM estimates themselves are products of conditional probabilities (conditional on survival until some time point). In symbols and if $X$ is a stochastic variable describing the time of failure of this hypothetical individual:

$$ P(X > 40 + s) = P(X > 40 + s | X > 40)P(X > 40), \ s \geq 0. $$

In conclusion, this amounts to the KM plot for "high" until age $40$ and at $40$ we use the conditional survival history of "low" (conditional on survival until $40$). To show it on a plot:

Conditional KM estimate of (highly) hypothetical subject

Some code to produce the plot, using built-in functions in R

library(ggplot2)
library(survMisc)
library(survival)


X1 <- rexp(n = 20)*50
X2 <- rexp(n = 20)*100

Sfit1 <- survfit(Surv(time = X1) ~ 1)
Sfit2 <- survfit(Surv(time = X2[X2 > 40]) ~ 1)

v  <- autoplot(Sfit1)$plot
p1 <- tail(v$data$surv[v$data$time < 40], 1)
t1 <- tail(v$data$time[v$data$time < 40], 1)


u <- autoplot(Sfit2)$plot
x <- c(t1, as.vector(u$data$time)[-1])



Sdata <- data.frame(x = x, y = p1*as.vector(u$data$surv), st = "2")

autoplot(Sfit1, title=NULL)$plot + geom_step(data=Sdata, aes(x=x, y=y, st=st))

However, one should probably still consider what the purpose of the plot really is. We're not really describing any of our subjects and it's not clear that we're describing a hypothetical (but plausible) subject either. You would want to remember that you're assuming that the hazard changes instantaneously, not only that the weight changes instantaneously. I'm no expert on human physiology, but a sudden weight loss probably entails other side-effects that are not appropriately modelled.

This is simulated data, but one should also keep in mind that the weight covariate is time-dependent, especially since we're also modelling young people and children. Treating it as time-independent is probably a bad idea. Also, the heavy people will be the ones that entered to study as adults as weight is measured at entry. The OP seems to be aware of this, though, but I thought I'd mention it anyway.

Related Question