R – Risk Table for a Kaplan-Meier Plot

ggplot2kaplan-meierrrisksurvival

I need to make a Kaplan Meier plot with an at-risk or risk-set table beneath it. Otherwise stated, I need a table of the number of subjects at risk at different time points aligned below the figure. I found a website that explains how to do this for a plot that contains multiple subgroups. This is the ggkm function, the code for which is available here.

# Example of a plot like this
library(survival)
data(colon)
fit <- survfit(Surv(time,status)~rx, data=colon)
ggkm(fit, timeby=500, ystratalabs=c("Obs","Lev","Lev+5FU"))

Example of a plot like this

Looking at the data that was used to make the above plot, we see:

Call: survfit(formula = Surv(time, status) ~ rx, data = colon)

           records n.max n.start events median 0.95LCL 0.95UCL
rx=Obs         630   630     630    345   1723    1323    2213
rx=Lev         620   620     620    333   1709    1219    2593
rx=Lev+5FU     608   608     608    242     NA      NA      NA

My question:
My plot only has one group. The call looks like this:

Call: survfit(formula = Surv(Recur_day/365.242, Recur) ~ 1, data = study_data)

records   n.max n.start  events  median 0.95LCL 0.95UCL 
    440     440     440      92      NA      NA      NA

My plot

When I try to use the ggkm function I get an error like so:

 ggkm(survfit(formula = Surv(Recur_day/365.242, Recur) ~ 1, data = study_data), timeby = 2)
Error in data.frame(time = sfit$time[subs2], n.risk = sfit$n.risk[subs2],  : 
  arguments imply differing number of rows: 1, 0
In addition: Warning message:
In max(nchar(ystratalabs)) :
  no non-missing arguments to max; returning -Inf

Does anyone have an idea of what I'm doing wrong? Or is there an easier way for me to do this by hand? I am open to just putting the data in a text box in PowerPoint myself, I'm just not certain of how to get the number of subjects at risk at each timepoint (I'd like to do at two years, four years, six years, etc.)

Best Answer

You would be well advised to check that code carefully. If you look at the number of cases with complete data the stating numbers in risk sets are significantly different:

sum(na.omit(colon)$rx=="Obs")
[1] 610   # was 630 in figure above, but could be missing other covatiates

This will generate a tabular calulation and then serially subtract the numbers of events that occurred in the prior interval:

risksets <- with(na.omit(colon[ , c("time", "status","rx")]),
                      table(rx, cut(time, seq(0, max(time), by=500) ) ))
sapply(1:3, function(i) 
              Reduce("-",  risksets[i,], init=rowSums(risksets)[i], accumulate=TRUE))
#-----------------
    [,1] [,2] [,3]
Obs  621  612  594
Obs  461  456  484
Obs  363  352  411
Obs  306  310  373
Obs  247  258  314
Obs   81   99  113
Obs    0    0    0

As is typical with R *apply functions, the result is transposed because it is returned as matrix columns. Use t() to fix it.

Related Solutions

Solved – How to plot adjusted Kaplan-Meier Curves

The only way to provide differential survival with true KM curves is to generate new curves for different groups. You could then display a curve for all persons of group 3, for example. The number of units in each group will decrease as the number of strata increase. However, this method is empiric and does not truly adjust the sample to some chosen set of values.

I am most familiar with methods for obtaining adjusted curves derived from Cox or parametric survival models. Generally speaking, the role of an adjusted curve is to graphically display the expected mortality (or mortality transformed to survival) of the sample if a single or combination of values is set to some fixed value or set of values, respectively.

For example, one might find from a Cox model the hazard ratio for blood pressure is 1.1. Thus, for each 1 unit increase in blood pressure, the average hazard at a given time point multiplies by 1.1. Now, if we wanted to display the mortality curve for all units under analysis (sample) adjusted to a blood pressure of 1 standard deviation above the mean, we could display an adjusted curve.

Here is a self-contained example using group for your reference. Note that the final, adjusted curve is for the mean of group which, for most applications, would have no real meaning. Also note that transformation from mortality to survival are required for this method.

library("survival")
require("survival")

days <- rpois(100, 3)
status <- rbinom(100,1,0.34)
group <- sample(c(1,2,3,4), 100, replace=TRUE)
df <- data.frame(days, status)

#overall survival
surv <- survfit(Surv(days, status)~1)
summary(surv)
plot(surv)

#survival by group
kmsurv <- survfit(Surv(days,status) ~ strata(group), df)
plot(kmsurv)

#survival adjusted to group effect
cox <- coxph(Surv(days,status)~group, df)
summary(cox)
plot(survfit(cox))

Essential information on the R code can be found here.

Lastly, it is my opinion that adjusted survival analysis is generally statistical sleight of hand as the adjustment process can 1) be used for unrealistic patterns of covariates, 2) fool the reader into believing that non-significant effects can result in some displayed survival/mortality pattern and 3) be confused with empiric curves, leading readers to believe you have more events or information for each subgroup/pattern than you actually possess. I would carefully consider why adjusted curves are desirable over adjusted hazard ratios before spending too much time.

Survival Curves – How to Use Surv and Survfit

The "mathematical reason" is fundamentally the reason why the curves differ. Welcome to the world of left-truncated survival times.

When a case has a start time greater than 0 in the way you formatted the data for sfit2, that case provides no information about survival prior to that start time. That's considered left truncation.

As you say, those left-truncated cases don't enter the risk set prior to that time. Each drop in the Kaplan-Meier (K-M) curve is determined by the ratio of the number of events at that time to the number of cases at risk. When you diminish the number of cases at risk at early times while keeping the same number of early events, the K-M curve necessarily drops faster at the start. With the product limit form of the K-M estimator, once the curve has dropped you have a lower baseline for the next drop.

Furthermore, the examples you show of left truncation seem not to enter the risk set until the original K-M curve is relatively flat beyond a time of 30, with relatively few later events. So they provide very little information at all, as they are in relatively few risk sets at event times and apparently only after most of the events have occurred, and thus have little influence on the subsequent shape of the curve.

The event = 3 specified in one of your examples evidently represents interval censoring, but you can also have left truncation with a defined event time if you specify event = 1 for the end time. That's the data format used for time-dependent covariates and for other applications of the counting-process data format in survival analysis, like repeated events.

Best Answer

Related Solutions

Solved – How to plot adjusted Kaplan-Meier Curves

Survival Curves – How to Use Surv and Survfit

Related Question