Solved – What does ‘km’ transform in cox.zph function mean

cox-modeldata transformationhazardproportional-hazardsr

I'm trying to understand how cox.zph function in R programming language works and I find myself not knowing what km transform mean. I get the rank transform and obviously the identity one, but km is still not clear to me and codes for this function did not make it clearer.

Best Answer

km stands for Kaplan-Meier estimator.

$$\hat{S}(t) = \prod_{i: t_i \le t}\left(1-\frac{d_i}{n_i} \right)$$

with $t_{i}$ a time when at least one event happened, $d_i$ the number of events (i.e., deaths) that happened at time $t_{i}$ and ${\displaystyle n_{i}}$ the individuals known to have survived (have not yet had an event or been censored) up to time $t_{i}$.

Here is quote from the paper Cox Proportional-Hazards Regression for Survival Data:

Tests and graphical diagnostics for proportional hazards may be based on the scaled Schoenfeld residuals; these can be obtained directly as residuals(model, "scaledsch"), where model is a coxph model object. The matrix returned by residuals has one column for each covariate in the model. More conveniently, the cox.zph function calculates tests of the proportional-hazards assumption for each covariate, by correlating the corresponding set of scaled Schoenfeld residuals with a suitable transformation of time [the default is based on the Kaplan-Meier estimate of the survival function, $K(t)$].

To know why the choice of km as the default, Dr. Kevin E. Thorpe cited Dr. Therneau's reply in the R-news:

There are 2 reasons for making the KM the default:

Safety: The test for PH is essentially a least-squares fit of line to a plot of f(time) vs residual. If the plot contains an extreme oulier in x, then the test is basically worthless. This sometimes happens with transform= identity or transform =log. It doesn't with transform='KM'.

As a default value for naive users, I chose the safe course.

A secondary reason is efficiency. In DY Lin, JASA 1991 Dan-Yu argues that this is a "good" test statistic under various assumptions about censoring. (His measure has the same score statistics as the KM option).

But #1 is the big one.

Terry T.

Related Solutions

Cox Proportional Hazard Model – Insights on Time Varying Coefficients in Cox Proportional Hazard Model

Since the seasonal dummy variables are static by nature, and their coefficients clearly vary with the time variable, how much does it matter?

The value that you get is a form of average over time. Unfortunately as you naturally have more cases in time-to-event analyses early on, you can't simply say that the effect is balanced throughout time. From my experience the initial period has a much heavier impact on the estimate than the later and it

I get that statistically, it means something, but does the violation of PH assumption invalidate the (intuitively appealing) result that non-response is more likely to happen in the summer and winter?

Again, it probably doesn't but you can't be sure until you've checked. Something is definitely happening over time and you should at least have a look at the residual plot (plot(cox.zph(...))). It isn't entirely surprising that you have a problem with the PH since seasons are part of the time variable and there will be situations where early summer and late spring are similar.

If so, is there a way to handle this so that the PH assumption is not violated? I know about using the tt transform, but I can't seem to figure out the exact form for the function.

The tt transform is tricky to use with big data. It explodes the matrix and can get a little messy, e.g. if you modify the lung example in the survival package:

library(survival)
coxph(Surv(time, status) ~ ph.ecog + tt(age), data=lung,
      tt=function(x,t,...) {
        print(length(x))
        pspline(x + t/365.25)
      })

It prints 15809 while there are only 228 rows in the original dataset. The principle of the tt() is that it feeds the variables into the transformation function where you are free to use time any way you wish. Note that you can also have different transformation functions depending for each variable:

library(survival)
coxph(Surv(time, status) ~ tt(ph.ecog) + tt(age), data=lung,
      tt=list(
        function(x,t,...) {
          cbind(x, x + t/365.25, (x + t/365.25)^2)
        },
        function(x,t,...) {
          pspline(x + t/365.25)
        }),
      x=T,
      y=T) -> fit
head(fit$x)

Gives:

    tt(ph.ecog)x tt(ph.ecog) tt(ph.ecog) tt(age)1 tt(age)2 tt(age)3 tt(age)4
6              1         3.4        11.7        0        0        0    0.000
3              0         2.4         5.8        0        0        0    0.020
38             1         3.4        11.7        0        0        0    0.000
5              0         2.4         5.8        0        0        0    0.000
6.1            1         3.2        10.4        0        0        0    0.000
3.1            0         2.2         5.0        0        0        0    0.026
    tt(age)5 tt(age)6 tt(age)7 tt(age)8 tt(age)9 tt(age)10 tt(age)11 tt(age)12
6       0.00  0.00000    0.000   0.0052    0.359      0.58     0.053         0
3       0.48  0.48232    0.021   0.0000    0.000      0.00     0.000         0
38      0.00  0.00087    0.266   0.6393    0.094      0.00     0.000         0
5       0.03  0.51933    0.437   0.0136    0.000      0.00     0.000         0
6.1     0.00  0.00000    0.000   0.0078    0.388      0.56     0.044         0
3.1     0.50  0.45457    0.016   0.0000    0.000      0.00     0.000         0

I therefore try to avoid this solution and use the time-split approach that I wrote about here and in my answer to my own question.

Solved – Schoenfeld residuals – Plain English explanation, please!

What's plotted starts with a variance-weighted transformation of the Schoenfeld residuals for a covariate, into what are called "scaled Schoenfeld residuals." Those are then added to the corresponding time-invariant coefficient estimate from the Cox model under the proportional hazards (PH) assumption and smoothed. The result is a plot of an estimate of the regression coefficient for the covariate over time. If the plot is reasonably flat, the PH assumption holds. Take this one step at a time.

Risk-weighted covariate averages and covariances

You start by determining, for each event time, the risk-weighted averages of covariate values and the corresponding risk-weighted covariance among covariate values over all individuals at risk at that time. That's essentially a part of the model-fitting process, anyway. The risks used for the weighting are simply the corresponding hazard ratios for the individuals at that time, the exponentiated linear-predictor values from the model.

Schoenfeld residuals

The Schoenfeld residuals are calculated for all covariates for each individual experiencing an event at a given time. Those are the differences between that individual's covariate values at the event time and the corresponding risk-weighted average of covariate values among all those then at risk. The word "residual" thus makes sense, as it's the difference between an observed covariate value and what you might have expected based on all those at risk at that time.

Scaled Schoenfeld residuals

The Schoenfeld residuals are then scaled inversely with respect to their (co)variances. The scaled values at an event time for an individual come from pre-multiplying the vector of original Schoenfeld residuals by the inverse of the corresponding risk-weighted covariate covariance matrix at that time. You can think of this as down-weighting Schoenfeld residuals whose values are uncertain because of high variance.

The "plot of scaled Schoenfeld residuals"

Although the plot you show is generally called a "plot of scaled Schoenfeld residuals," that's not quite right.

The importance of the scaled Schoenfeld residuals comes from their associations with the time-dependence of a Cox regression coefficient. If $s_{k,j}^*$ is a scaled Schoenfeld residual for covariate $j$ at time $t_k$ and the estimated time-fixed Cox regression coefficient under PH is $\hat \beta_j$, then the expected value of $s_{k,j}^*$ is approximately the deviation of the actual coefficient value at time $t_k$, $\beta_j(t_k)$, from the PH-based estimate:

$$E(s_{k,j}^*) + \hat \beta_j \approx \beta_j(t_k).$$

That was shown by Grambsch and Therneau in 1994. The y-axis values of the plot for covariate $j$ are the sums of the scaled Schoenfeld residuals with the corresponding PH estimate $\hat \beta_j$.

Simple answer to the question

The smoothed plot is thus an estimate of the time dependence of the coefficient for the covariate $j$, $\beta_j(t_k)$. In your case, the plot indicates that your biomarker is most strongly associated with outcome at early times, dropping off to almost no association beyond a time value of 50-60.

The above is pretty much based on Therneau and Grambsch. Section 6.2 presents the plotting of scaled Schoenfeld residuals, with ordinary Schoenfeld residuals described in Section 4.6 and the formulas for risk-weighted covariate means and covariances in Section 3.1 (equations 3.5 and 3.7, respectively).

Best Answer

Related Solutions

Cox Proportional Hazard Model – Insights on Time Varying Coefficients in Cox Proportional Hazard Model

Solved – Schoenfeld residuals – Plain English explanation, please!

Related Question