Cox Proportional Hazard Model – Insights on Time Varying Coefficients in Cox Proportional Hazard Model

cox-modelproportional-hazardsrsurvival

I am trying to fit a coxph model in R. The study can be described as follows: I have a very large dataset, in counting process form, containing whether or not someone responded to a survey or not. The time variable is the consecutive number of months in which a response was received. The model predicts non-response, i.e. when someone does not respond. There are more than one record per id (itemcode in my dataset). There are no continuous covariates as of now. What is included in the model are seasonal effects–I would like to know how each season increases or lowers the hazard of non-response, relative to fall. The I have already stratified the model. The results are below:

Call:
coxph(formula = Surv(start, cum.goodp, dlq.next) ~ winter + spring + 
summer + strata(sector) + cluster(itemcode), data = nr.sample.split)

n= 651033, number of events= 42508 

       coef exp(coef) se(coef) robust se      z Pr(>|z|)    
winter  0.26850   1.30800  0.01307   0.01283  20.92   <2e-16 ***
spring -0.64040   0.52708  0.01385   0.01342 -47.72   <2e-16 ***
summer  0.29188   1.33894  0.01414   0.01284  22.73   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

   exp(coef) exp(-coef) lower .95 upper .95
winter    1.3080     0.7645    1.2755    1.3413
spring    0.5271     1.8972    0.5134    0.5411
summer    1.3389     0.7469    1.3057    1.3731

Concordance= 0.598  (se = 0.004 )
Rsquare= 0.009   (max possible= 0.636 )
Likelihood ratio test= 5864  on 3 df,   p=0
Wald test            = 4783  on 3 df,   p=0
Score (logrank) test = 5634  on 3 df,   p=0,   Robust = 5015  p=0

 (Note: the likelihood ratio and score tests assume independence of
 observations within a cluster, the Wald and robust score tests do not).

I then estimated a cox.zph function to test the PH assumption, the results of which are below:

           rho   chisq       p
winter -0.1283  691.45 0.00000
spring -0.1151  569.35 0.00000
summer -0.0163    9.36 0.00221
GLOBAL      NA 1096.18 0.00000 

Clearly, the PH assumption is not valid for any of the coefficients. Below is a plot of one, summer:

[![Plot of Beta(t) for coefficient "summer"][1]][1]

My question is: since the seasonal dummy variables are static by nature, and their coefficients clearly vary with the time variable, how much does it matter? I get that statistically, it means something, but does the violation of PH assumption invalidate the (intuitively appealing) result that non-response is more likely to happen in the summer and winter? If so, is there a way to handle this so that the PH assumption is not violated? I know about using the tt transform, but I can't seem to figure out the exact form for the function. Any advice, ideas or references would be greatly appreciated.

Best Answer

Since the seasonal dummy variables are static by nature, and their coefficients clearly vary with the time variable, how much does it matter?

The value that you get is a form of average over time. Unfortunately as you naturally have more cases in time-to-event analyses early on, you can't simply say that the effect is balanced throughout time. From my experience the initial period has a much heavier impact on the estimate than the later and it

I get that statistically, it means something, but does the violation of PH assumption invalidate the (intuitively appealing) result that non-response is more likely to happen in the summer and winter?

Again, it probably doesn't but you can't be sure until you've checked. Something is definitely happening over time and you should at least have a look at the residual plot (plot(cox.zph(...))). It isn't entirely surprising that you have a problem with the PH since seasons are part of the time variable and there will be situations where early summer and late spring are similar.

If so, is there a way to handle this so that the PH assumption is not violated? I know about using the tt transform, but I can't seem to figure out the exact form for the function.

The tt transform is tricky to use with big data. It explodes the matrix and can get a little messy, e.g. if you modify the lung example in the survival package:

library(survival)
coxph(Surv(time, status) ~ ph.ecog + tt(age), data=lung,
      tt=function(x,t,...) {
        print(length(x))
        pspline(x + t/365.25)
      })

It prints 15809 while there are only 228 rows in the original dataset. The principle of the tt() is that it feeds the variables into the transformation function where you are free to use time any way you wish. Note that you can also have different transformation functions depending for each variable:

library(survival)
coxph(Surv(time, status) ~ tt(ph.ecog) + tt(age), data=lung,
      tt=list(
        function(x,t,...) {
          cbind(x, x + t/365.25, (x + t/365.25)^2)
        },
        function(x,t,...) {
          pspline(x + t/365.25)
        }),
      x=T,
      y=T) -> fit
head(fit$x)

Gives:

    tt(ph.ecog)x tt(ph.ecog) tt(ph.ecog) tt(age)1 tt(age)2 tt(age)3 tt(age)4
6              1         3.4        11.7        0        0        0    0.000
3              0         2.4         5.8        0        0        0    0.020
38             1         3.4        11.7        0        0        0    0.000
5              0         2.4         5.8        0        0        0    0.000
6.1            1         3.2        10.4        0        0        0    0.000
3.1            0         2.2         5.0        0        0        0    0.026
    tt(age)5 tt(age)6 tt(age)7 tt(age)8 tt(age)9 tt(age)10 tt(age)11 tt(age)12
6       0.00  0.00000    0.000   0.0052    0.359      0.58     0.053         0
3       0.48  0.48232    0.021   0.0000    0.000      0.00     0.000         0
38      0.00  0.00087    0.266   0.6393    0.094      0.00     0.000         0
5       0.03  0.51933    0.437   0.0136    0.000      0.00     0.000         0
6.1     0.00  0.00000    0.000   0.0078    0.388      0.56     0.044         0
3.1     0.50  0.45457    0.016   0.0000    0.000      0.00     0.000         0

I therefore try to avoid this solution and use the time-split approach that I wrote about here and in my answer to my own question.

Related Question