Solved – Determine if a time-dependent Cox model is appropriate

cox-modelproportional-hazardssurvivaltime-varying-covariate

Before the description, here are my questions

(1) Is the set-up of my time-dependent data correct?

(2) Are the ways I run my Cox proportional hazard model with a time-dependent variable/ non time-dependent variable correct?

(3) Which one would be valid for my purpose (described below)?

(4) Why are the Schoenfeld residuals different between the two methods?

(5) Cox proportional hazard assumes that the hazard ratio stays constant through time. Does that mean that variable effect stays constant through time? I'm having difficulty thinking about how the effect of the variable (described below) may change through time, because the variable itself changes with time, so it's not easy to point to a start time and say that the effect of the variable with the value at that time varies as time goes on.

Data

After various transformation, my data look like this

     state     start stop   time event  statesh         nail
     (chr)     (dbl) (dbl)  (dbl) (int)   (int)         (chr)
1  California  1956  1957    NA     0      NA             0
2  California  1957  1958    NA     0      NA             0
3  California  1958  1959    NA     0      NA             0
4  California  1959  1960    NA     0      NA             0
5  California  1960  1961    NA     0      NA             0
6  California  1961  1962    NA     0      NA             0
7  California  1962  1963    NA     0      NA             0
8  California  1963  1964    NA     0      NA             0
9  California  1964  1965    NA     0      NA             0
10 California  1965  1966    NA     0      NA             0
11 California  1966  1967    NA     0      NA             0
12 California  1967  1968    NA     0      NA             0
13 California  1968  1969    NA     0      NA             0
14 California  1969  1970    NA     0      NA             0
15 California  1970  1971    NA     0  519228             0
16 California  1971  1972    NA     0  575740             1
17 California  1972  1973    NA     0  625530             1
18 California  1973  1974    8      1  644516             1
...

with state being the subject. There are 50 states, so there are 50 subjects.

Problem and variables description

I'd like to model the relationship between the independent variable statesh and the time to event.

statesh is a state's annual expense related to the event I'm interested in, so it is time-dependent (right?). This variable has a large portion of missing values.

nail is a categorical control variable that I'll include in the model later; it is a historical variable in the sense that it resulted from legislation decision, so I don't think it is time-dependent even though it varies with time.

Models result

When I run the Cox proportional hazard model with statesh as time-dependent variable, I get

> coxph = coxph(Surv(start, stop, event) ~ statesh, data = first.data, method = "breslow")

> kable(tidyoutput(coxph))


|term    | estimate| exp.coef|  p.value|
|:-------|--------:|--------:|--------:|
|statesh |  1.5e-06| 1.000001| 0.018621|

> test = cox.zph(coxph, transform = log)
> test
           rho  chisq     p
statesh 0.0518 0.0825 0.774
> plot(test)

When I run the Cox model with statesh as a static variable (i.e., from the dataset above, I filter out every row where time == NA), I get

> coxph = coxph(Surv(time, event) ~ statesh, data = first.data, method = "breslow")

> kable(tidyoutput(coxph))


|term    | estimate| exp.coef|  p.value|
|:-------|--------:|--------:|--------:|
|statesh |    4e-07|        1| 0.627078|


> test = cox.zph(coxph, transform = log)
> test
           rho  chisq     p
statesh 0.0138 0.0111 0.916

Best Answer

The vignette on time-dependent covariates and coefficients from the survival package in R is a useful introduction to these issues. It's worth close and repeated reading.

I'm assuming in this answer that there is only 1 event associated with each state. If not, you will have to look into analysis of recurrent events. I'm also assuming that the basic model makes sense; for example, that your statesh isn't simply a proxy for some other variable like population, or that if such is the case then that's what you intend.

Data setup: Your data format for time-dependent covariates is standard. Note that your nail variable does change over time, suggesting that it should be included as a time-dependent covariate if you include it in the model. That will happen automatically if you use the (start,stop,event) formulation. Note that rows having NA values for needed data will be ignored, so the first 14 lines could be omitted.

Also, be careful with how you handle the times associated with the time-dependent covariates: you don't want to be in a situation where you are looking forward into the future for a "predictor." The covariate values should be listed at times that represent their values at the times of the events, not at some time thereafter. You don't want the covariate value to be for the end of a year if the event happens previously during the year; for example, if statesh is changed in response to an event but you use an end-of-year value for it then your model would not be what you intend. That might require some restructuring of the data.

Models, meeting your purpose, residuals: The first time-dependent model seems OK, with the caution above on relative timing of events and covariate values. That would seem to be the model that best meets your interests, if you think that the value of statesh during the year is what matters in terms of affecting the probabiity of an event. I haven't used log transforms for time variables in a cox.zph quality-control check, but I suppose it's OK.

In the second model, if the data are limited to one row per state then you are limiting analysis to the covariate values at the times of events; I trust that if any states never had events then you have included covariates and censoring indicators for them at the last available time point. It's implicitly making the assumption that the statesh variable was constant, at that value, from time 0. If you know that the covariate values change with time this would not seem to be appropriate.

If you have 2 different models then you would expect different values for residuals.

Interpretation of time-dependent covariates and their coefficients The idea is that only present covariate values matter and that the relative hazard, among cases, associated with the covariate variables is constant over time. In that sense the effect of each predictor variable is constant over time; it's the value of the variable that changes over time, with the relative hazard for a case changing as its value changes. Only the current values of the covariates matter at any time, not their initial values if they are time-dependent.

The analysis proceeds from event time to event time, and for each event time it considers the covariate values at that time for the case with the event against the covariate values at that time for all the other cases that were still at risk. So the comparisons are for difference among cases at a time, not for changes in values with time. Thus there is no inherent memory for covariates. The calls that you use the algorithm doesn't even attempt to remember which cases are which, only whether they are still at risk at a given time.

You may incorporate lagged values of covariates if you want to include some form of memory. This document provides an example.

Also, be sure to distinguish this situation with time-dependent covariates from that with time-dependent coefficients. In the latter case the effects of the variables themselves do change. Time-dependent coefficients may be required if there are non-proportional hazards in the standard Cox regression. Section 4 of the vignette goes into some detail, and shows how time-transform (tt) functions can be used to simplify handling either of covariates or of coefficients with known or assumed forms of time dependence.

Related Question