The major breakthrough of the joint models relative to the time-dependent Cox model is that they allow one to deal with the error measurements in the time dependent variables (longitudinal variable in this case). In a Cox model with time dependent covariates we assume that the variables are measured without error.
Some references:
Tsiatis, A. A. e M. Davidian (2004). Joint modeling of longitudinal and time-to-event data: An overview. Statistica Sinica 14 (3),809-834.
Rizopoulos, D. (2012b). Joint Models for Longitudinal and Time-toEvent Data With Applications in R. Chapman and Hall/CRC
Henderson, R., P. Diggle, e A. Dobson (2000, Dec). Joint modelling
of longitudinal measurements and event time data. Biostatistics 1 (4), 465-480.
The vignette on time-dependent covariates and coefficients from the survival
package in R is a useful introduction to these issues. It's worth close and repeated reading.
I'm assuming in this answer that there is only 1 event associated with each state
. If not, you will have to look into analysis of recurrent events. I'm also assuming that the basic model makes sense; for example, that your statesh
isn't simply a proxy for some other variable like population, or that if such is the case then that's what you intend.
Data setup: Your data format for time-dependent covariates is standard. Note that your nail
variable does change over time, suggesting that it should be included as a time-dependent covariate if you include it in the model. That will happen automatically if you use the (start,stop,event)
formulation. Note that rows having NA
values for needed data will be ignored, so the first 14 lines could be omitted.
Also, be careful with how you handle the times associated with the time-dependent covariates: you don't want to be in a situation where you are looking forward into the future for a "predictor." The covariate values should be listed at times that represent their values at the times of the events, not at some time thereafter. You don't want the covariate value to be for the end of a year if the event happens previously during the year; for example, if statesh
is changed in response to an event but you use an end-of-year value for it then your model would not be what you intend. That might require some restructuring of the data.
Models, meeting your purpose, residuals: The first time-dependent model seems OK, with the caution above on relative timing of events and covariate values. That would seem to be the model that best meets your interests, if you think that the value of statesh
during the year is what matters in terms of affecting the probabiity of an event. I haven't used log transforms for time variables in a cox.zph
quality-control check, but I suppose it's OK.
In the second model, if the data are limited to one row per state
then you are limiting analysis to the covariate values at the times of events; I trust that if any states
never had events then you have included covariates and censoring indicators for them at the last available time point. It's implicitly making the assumption that the statesh
variable was constant, at that value, from time 0. If you know that the covariate values change with time this would not seem to be appropriate.
If you have 2 different models then you would expect different values for residuals.
Interpretation of time-dependent covariates and their coefficients The idea is that only present covariate values matter and that the relative hazard, among cases, associated with the covariate variables is constant over time. In that sense the effect of each predictor variable is constant over time; it's the value of the variable that changes over time, with the relative hazard for a case changing as its value changes. Only the current values of the covariates matter at any time, not their initial values if they are time-dependent.
The analysis proceeds from event time to event time, and for each event time it considers the covariate values at that time for the case with the event against the covariate values at that time for all the other cases that were still at risk. So the comparisons are for difference among cases at a time, not for changes in values with time. Thus there is no inherent memory for covariates. The calls that you use the algorithm doesn't even attempt to remember which cases are which, only whether they are still at risk at a given time.
You may incorporate lagged values of covariates if you want to include some form of memory. This document provides an example.
Also, be sure to distinguish this situation with time-dependent covariates from that with time-dependent coefficients. In the latter case the effects of the variables themselves do change. Time-dependent coefficients may be required if there are non-proportional hazards in the standard Cox regression. Section 4 of the vignette goes into some detail, and shows how time-transform (tt
) functions can be used to simplify handling either of covariates or of coefficients with known or assumed forms of time dependence.
Best Answer
Is there a specific reason why you are using penalized?
It strikes me that with multiple ids per row it makes sense to use cluster() in the survival package and denote the patient the observation belongs to.