Cox Model – Understanding the Cox Proportional Hazards Model

cox-modelhazardmodelself-studysurvival

I'm trying to fit a Cox model, but there is some problems. I have the following variables in the model.

Group: 1, 2, …, 9
Sex: 1 female and 0 male
Weight
Age

The first thing that I did is split the variables Age and Weight in 4 different groups and check if the assumption of proportional hazards is met for each variable. I did the plot of $-log(-log S(t))\times t$.

The plot below is from Groups

For all the four variables the proportional hazard assumption is violated (crossed curves). Then I check it with hypothesis test and run the model

model<-coxph(Surv(Time,Event)~ Group + Sex + Weight + Age,data= dataset)
summary(model)
         coef     exp(coef)  se(coef)   z     Pr(>|z|)  
G2  0.1705602  1.1859691  0.1956226  0.872 0.383272    
G3 -1.0036611  0.3665351  0.2386762 -4.205 2.61e-05 ***
G4 -0.8381683  0.4325020  0.2399613 -3.493 0.000478 ***
G5 -0.4544249  0.6348130  0.2092611 -2.172 0.029888 *  
G6 -0.9123168  0.4015927  0.3471589 -2.628 0.008590 ** 
G7 -0.9977854  0.3686950  0.2413699 -4.134 3.57e-05 ***
G8 -1.7056585  0.1816527  0.3097035 -5.507 3.64e-08 ***
G9 -1.1614730  0.3130248  0.2488757 -4.667 3.06e-06 ***
Sex    -0.0307328  0.9697347  0.1331374 -0.231 0.817443    
Weight 0.0004572  1.0004573  0.0004121  1.109 0.267330    
Age    0.0044168  1.0044266  0.0036702  1.203 0.228815

From the summary of model, Sex, Weight, Age are not significant. Then the model just have groups as variables.

So I did

cox.zph(model,transform="rank",global=TRUE)

              rho   chisq        p
G2 -0.1142  4.2426 0.039423
G3 -0.1732 10.6197 0.001119
G4 -0.0989  3.2302 0.072293
G5 -0.1588  8.7741 0.003055
G6 -0.1284  5.4636 0.019416
G7 -0.0508  0.9136 0.339165
G8  0.0984  3.3136 0.068709
G9 -0.1062  4.1598 0.041395
Sex    0.0085  0.0242 0.876276
Weight     0.1121  5.1191 0.023664
Age      -0.0109  0.0372 0.846986
GLOBAL         NA 36.2568 0.000153

I don't understand well this output, Group7 have prorportional hazard alone? How Sex, Age have proportional hazards if the curves of plot crossed?

If one level of categorical variable not hold the proportional hazard assumption, then the categorical variable not met the assumption right?

I made a several tests about proportionality, with graphs and tests with time dependent covariates, and in fact this assumption is not met, but I adjusted a stratified cox model by groups, and the output is below

                 coef  exp(coef)   se(coef)      z Pr(>|z|)
    Sex -0.0295480  0.9708843  0.1331459 -0.222    0.824
    Weight    0.0004545  1.0004546  0.0004111  1.105    0.269
    Age     0.0043919  1.0044016  0.0036679  1.197    0.231
      exp(coef) exp(-coef) lower .95 upper .95
Sex    0.9709     1.0300    0.7479     1.260
Weight     1.0005     0.9995    0.9996     1.001
Age       1.0044     0.9956    0.9972     1.012

Concordance= 0.532  (se = 0.045 )
Rsquare= 0.001   (max possible= 0.719 )
Likelihood ratio test= 3.11  on 3 df,   p=0.3745
Wald test            = 3.14  on 3 df,   p=0.3712
Score (logrank) test = 3.14  on 3 df,   p=0.3712

Here what I see is:

The variables are not statistically significant
The effects (hazards) of each variable are really closed to 1 for Weight and Age and for Sex a litle less. Then this variables have no effect on the survival time.

So I have no reason to keep them in the model, which would leave me with only the variable group that does not meet the proportionality hypothesis.

I begin to think that a parametric model is the best option for this case.

Best Answer

One issue here is your choice of reference level for the Group variable, G1. The regression coefficients for other Groups are with respect to that reference level, and as I understand it the same is true for the "significant" non-proportionalities seen for the other Groups. Note that this type of summary does not provide a test for the significance of the Group variable as a whole. Had you chosen a different reference level, much of the difficulty with non-proportionality might have been isolated to just one or two Groups. It's important to think about the subject-matter content of your data; there might be good reasons why some groups have different hazard time courses than others.

Also, be careful about how you interpret the p-values for the cox.zph tests. A low p-value for a coefficient is evidence that the proportional hazards assumption doesn't hold for its associated predictor, but a "non-significant" p-value is not proof that the proportional hazards assumption is met. As with any statistical test, a non-significant p-value might simply mean two few cases or too much variability to argue against the null hypothesis of a proportional hazard. It's hard to tell from your graph, but that might explain why crossing plots have p-values that do not rule out the PH assumption.

Related Solutions

Solved – Handling borderline cases of the proportional hazards assumption

There are several options: I'd recommend examining the impact of the assumption on your hazard ratio estimates as a next step, rather than relying on a test statistic (or even the log-log graphs -- from a determining the impact perspective.) The two I'd initially suggest:

explicitly add in a time * covariate interaction to examine how the hazard ratio for your covariate changes over time.
add in a "heaviside" interaction term that explicitly models a hazard ratio for an "early" period and a "late" period. This is probably only sensible when you have an a priori cut-off for defining early/late (e.g. we've used it when modelling rescreening times in breast cancer, where there are defined screening provider targets for a 27 month rescreen interval: so we defined the break point for the heaviside function at 27 months.)

Obviously stratification by the covariate in question isn't an option... since there's only one covariate and it's the one you're principally interested in!

Link for some options in Survival analysis: a self learning text

and also see the following presentation (which is on the mathematical side) course notes from National University of Singapore

Cox Model – Extrapolating Effect of Covariable Changes in Cox Proportional Hazards Models

I would suggest you do it non-parametrically. The procedure as you describe it imposes assumptions on the way the failure functions can relate to each other, basically because the Cox model introduces the assumption of proportional hazards. Therefore, I would argue that the red and black curves in the plot are a visualization of the model, more than they are estimates of failure functions. Not that those two things couldn't coincide, but why make this further assumption?

If you want to do something similar but non-parametrical, I would suggest using the Kaplan-Meier estimates instead. You would have to divide the weight variable into groups (assuming it's continuous), e.g. "low" and "high". You would still be able to do the counterfactual analysis that you want, simply by making a "conditional" KM plot similar to the green one above. So the green curve would be the KM of the "high" group until age $40$. At age $40$ the KM of the "low" kgs group (for $+40$ years) would continue, pasted onto the "high" ending at $40$. The KM estimate is the estimated probability of reaching age $t$, thus, for the hypothetical individual changing weight groups we can think of the probability of reaching age $40 + s$ as the probability of living from $40$ to $40 + s$ in the low weight group given survival until $40$ times the probability of living from $0$ to $40$ in the high weight group. This will exactly correspond to "pasting" the KM estimates together at age $40$. Note that the KM estimates themselves are products of conditional probabilities (conditional on survival until some time point). In symbols and if $X$ is a stochastic variable describing the time of failure of this hypothetical individual:

$$ P(X > 40 + s) = P(X > 40 + s | X > 40)P(X > 40), \ s \geq 0. $$

In conclusion, this amounts to the KM plot for "high" until age $40$ and at $40$ we use the conditional survival history of "low" (conditional on survival until $40$). To show it on a plot:

Conditional KM estimate of (highly) hypothetical subject

Some code to produce the plot, using built-in functions in R

library(ggplot2)
library(survMisc)
library(survival)


X1 <- rexp(n = 20)*50
X2 <- rexp(n = 20)*100

Sfit1 <- survfit(Surv(time = X1) ~ 1)
Sfit2 <- survfit(Surv(time = X2[X2 > 40]) ~ 1)

v  <- autoplot(Sfit1)$plot
p1 <- tail(v$data$surv[v$data$time < 40], 1)
t1 <- tail(v$data$time[v$data$time < 40], 1)


u <- autoplot(Sfit2)$plot
x <- c(t1, as.vector(u$data$time)[-1])



Sdata <- data.frame(x = x, y = p1*as.vector(u$data$surv), st = "2")

autoplot(Sfit1, title=NULL)$plot + geom_step(data=Sdata, aes(x=x, y=y, st=st))

However, one should probably still consider what the purpose of the plot really is. We're not really describing any of our subjects and it's not clear that we're describing a hypothetical (but plausible) subject either. You would want to remember that you're assuming that the hazard changes instantaneously, not only that the weight changes instantaneously. I'm no expert on human physiology, but a sudden weight loss probably entails other side-effects that are not appropriately modelled.

This is simulated data, but one should also keep in mind that the weight covariate is time-dependent, especially since we're also modelling young people and children. Treating it as time-independent is probably a bad idea. Also, the heavy people will be the ones that entered to study as adults as weight is measured at entry. The OP seems to be aware of this, though, but I thought I'd mention it anyway.

Best Answer

Related Solutions

Solved – Handling borderline cases of the proportional hazards assumption

Cox Model – Extrapolating Effect of Covariable Changes in Cox Proportional Hazards Models

Related Question