I'm trying to understand how `cox.zph`

function in R programming language works and I find myself not knowing what `km`

transform mean. I get the `rank`

transform and obviously the `identity`

one, but `km`

is still not clear to me and codes for this function did not make it clearer.

# Solved – What does ‘km’ transform in cox.zph function mean

cox-modeldata transformationhazardproportional-hazardsr

#### Related Solutions

**Since the seasonal dummy variables are static by nature, and their coefficients clearly vary with the time variable, how much does it matter?**

The value that you get is a form of average over time. Unfortunately as you naturally have more cases in time-to-event analyses early on, you can't simply say that the effect is balanced throughout time. From my experience the initial period has a much heavier impact on the estimate than the later and it

**I get that statistically, it means something, but does the violation of PH assumption invalidate the (intuitively appealing) result that non-response is more likely to happen in the summer and winter?**

Again, it probably doesn't but you can't be sure until you've checked. Something is definitely happening over time and you should at least have a look at the residual plot (`plot(cox.zph(...))`

). It isn't entirely surprising that you have a problem with the PH since seasons are part of the time variable and there will be situations where early summer and late spring are similar.

**If so, is there a way to handle this so that the PH assumption is not violated? I know about using the tt transform, but I can't seem to figure out the exact form for the function.**

The `tt`

transform is tricky to use with big data. It explodes the matrix and can get a little messy, e.g. if you modify the lung example in the survival package:

```
library(survival)
coxph(Surv(time, status) ~ ph.ecog + tt(age), data=lung,
tt=function(x,t,...) {
print(length(x))
pspline(x + t/365.25)
})
```

It prints **15809** while there are only **228** rows in the original dataset. The principle of the `tt()`

is that it feeds the variables into the transformation function where you are free to use time any way you wish. Note that you can also have different transformation functions depending for each variable:

```
library(survival)
coxph(Surv(time, status) ~ tt(ph.ecog) + tt(age), data=lung,
tt=list(
function(x,t,...) {
cbind(x, x + t/365.25, (x + t/365.25)^2)
},
function(x,t,...) {
pspline(x + t/365.25)
}),
x=T,
y=T) -> fit
head(fit$x)
```

Gives:

```
tt(ph.ecog)x tt(ph.ecog) tt(ph.ecog) tt(age)1 tt(age)2 tt(age)3 tt(age)4
6 1 3.4 11.7 0 0 0 0.000
3 0 2.4 5.8 0 0 0 0.020
38 1 3.4 11.7 0 0 0 0.000
5 0 2.4 5.8 0 0 0 0.000
6.1 1 3.2 10.4 0 0 0 0.000
3.1 0 2.2 5.0 0 0 0 0.026
tt(age)5 tt(age)6 tt(age)7 tt(age)8 tt(age)9 tt(age)10 tt(age)11 tt(age)12
6 0.00 0.00000 0.000 0.0052 0.359 0.58 0.053 0
3 0.48 0.48232 0.021 0.0000 0.000 0.00 0.000 0
38 0.00 0.00087 0.266 0.6393 0.094 0.00 0.000 0
5 0.03 0.51933 0.437 0.0136 0.000 0.00 0.000 0
6.1 0.00 0.00000 0.000 0.0078 0.388 0.56 0.044 0
3.1 0.50 0.45457 0.016 0.0000 0.000 0.00 0.000 0
```

I therefore try to avoid this solution and use the time-split approach that I wrote about here and in my answer to my own question.

What's plotted starts with a variance-weighted transformation of the Schoenfeld residuals for a covariate, into what are called "scaled Schoenfeld residuals." Those are then added to the corresponding time-invariant coefficient estimate from the Cox model under the proportional hazards (PH) assumption and smoothed. The result is a plot of an estimate of the regression coefficient for the covariate over time. If the plot is reasonably flat, the PH assumption holds. Take this one step at a time.

*Risk-weighted covariate averages and covariances*

You start by determining, for each event time, the risk-weighted averages of covariate values and the corresponding risk-weighted covariance among covariate values over all individuals at risk at that time. That's essentially a part of the model-fitting process, anyway. The risks used for the weighting are simply the corresponding hazard ratios for the individuals at that time, the exponentiated linear-predictor values from the model.

*Schoenfeld residuals*

The Schoenfeld residuals are calculated for all covariates for each individual experiencing an event at a given time. Those are the *differences* between that individual's covariate values at the event time and the corresponding risk-weighted average of covariate values among all those then at risk. The word "residual" thus makes sense, as it's the difference between an observed covariate value and what you might have expected based on all those at risk at that time.

*Scaled Schoenfeld residuals*

The Schoenfeld residuals are then scaled inversely with respect to their (co)variances. The scaled values at an event time for an individual come from pre-multiplying the vector of original Schoenfeld residuals by the inverse of the corresponding risk-weighted covariate covariance matrix at that time. You can think of this as down-weighting Schoenfeld residuals whose values are uncertain because of high variance.

*The "plot of scaled Schoenfeld residuals"*

Although the plot you show is generally called a "plot of scaled Schoenfeld residuals," that's not quite right.

The importance of the scaled Schoenfeld residuals comes from their associations with the *time-dependence* of a Cox regression coefficient. If $s_{k,j}^*$ is a scaled Schoenfeld residual for covariate $j$ at time $t_k$ and the estimated time-fixed Cox regression coefficient under PH is $\hat \beta_j$, then the expected value of $s_{k,j}^*$ is approximately the deviation of the actual coefficient value at time $t_k$, $\beta_j(t_k)$, from the PH-based estimate:

$$E(s_{k,j}^*) + \hat \beta_j \approx \beta_j(t_k).$$

That was shown by Grambsch and Therneau in 1994. The y-axis values of the plot for covariate $j$ are the *sums* of the scaled Schoenfeld residuals with the corresponding PH estimate $\hat \beta_j$.

*Simple answer to the question*

The smoothed plot is thus **an estimate of the time dependence of the coefficient for the covariate $j$,** $\beta_j(t_k)$. In your case, the plot indicates that your biomarker is most strongly associated with outcome at early times, dropping off to almost no association beyond a time value of 50-60.

The above is pretty much based on Therneau and Grambsch. Section 6.2 presents the plotting of scaled Schoenfeld residuals, with ordinary Schoenfeld residuals described in Section 4.6 and the formulas for risk-weighted covariate means and covariances in Section 3.1 (equations 3.5 and 3.7, respectively).

## Best Answer

km stands for Kaplan-Meier estimator.

$$\hat{S}(t) = \prod_{i: t_i \le t}\left(1-\frac{d_i}{n_i} \right)$$

with $t_{i}$ a time when at least one event happened, $d_i$ the number of events (i.e., deaths) that happened at time $t_{i}$ and ${\displaystyle n_{i}}$ the individuals known to have survived (have not yet had an event or been censored) up to time $t_{i}$.

Here is quote from the paper Cox Proportional-Hazards Regression for Survival Data:

To know why the choice of

`km`

as the default, Dr. Kevin E. Thorpe cited Dr. Therneau's reply in the R-news: