Repeated Measures in R’s coxph() – Detailed Handling

cox-modelfrailtyrrepeated measuressurvival

Context

I'm attempting to understand how R's coxph() accepts and handles repeated entries for subjects (or patient/customer if you prefer). Some call this Long format, others call it 'repeated measures'.

See for example the data set that includes the ID column in the Answers section at:

Best packages for Cox models with time varying covariates

Also assume covariates are time-varying throughout and there is exactly one censor (i.e. event) variable, which is binary.

Questions

1) In the above link's answer, if ID is not given as a parameter in the call to coxph() should the results be the same as including cluster(ID) as a parameter in coxph()?

I attempted to search for documentation, but the following doesn't seem to clearly address (1):
https://stat.ethz.ch/pipermail/r-help//2013-July/357466.html

2) If the answer to (1) is 'no', then (mathematically) why? It seems cluster() in coxph() seeks correlations between subjects as per subsection 'cluster' on pg. 20 at

https://cran.r-project.org/web/packages/survival/survival.pdf

3) Vague question: how does coxph() with repeated measures compare to R's frailtypack regression methods?

Addenda

The following hints at using cluster(ID):

Is there a repeated measures aware version of the logrank test?

as does:

https://stat.ethz.ch/pipermail/r-help//2013-July/357466.html

GEE approach: add "+ cluster(subject)" to the model statement in coxph
Mixed models approach: Add " + (1|subject)" to the model statment in coxme.

Thanks in advance!

Best Answer

  1. Including cluster(ID) does not change the point estimates of the parameters. It does change the way that the standard errors are computed however.

    More details can be found in Therneau & Grambsch's book Extending the Cox Model, chapter 8.2. Note that in their example, they use method = "breslow" as correction for ties, but also with the default (method = "efron") a similar calculation for the se's will be used, and appears in the summary as "robust se".

  2. If cluster(ID) is used, a "robust" estimate of standard errors is imposed and possible dependence between subjects is measured (e.g. by standard errors and variance scores). Not using cluster(ID), on the other hand, imposes independence on each observation and more "information" is assumed in the data. In more technical terms, the score function for the parameters does not change, but the variance of this score does. A more intuitive argument is that 100 observations on 100 individuals provide more information than 100 observations on 10 individuals (or clusters).

  3. Vague indeed. In short, +frailty(ID) in coxph() fits standard frailty models with gamma or log-normal random effects and with non-parametric baseline hazard / intensity. frailtypack uses parametric baseline (also flexible versions with splines or piecewise constant functions) and also fits more complicated models, such as correlated frailty, nested frailty, etc.

Finally, +cluster() is somewhat in the spirit of GEE, in that you take the score equations from a likelihood with independent observations, and use a different "robust" estimator for the standard errors.

edit: Thanks @Ivan for the suggestions regarding the clarity of the post.