Calculate a life table for different baseline groups (Cox survival analysis)

cox-modelrsurvival

I wish to recreate a table from a study (DOI: 10.1097/MAO.0000000000001914).
The study includes survival analyses on whether or not patients loose their hearing. Hearing is described using two outcomes: pure-tone average hearing threshold (PTA) in decibels and a standardized word-recognition score (WRS) in % recognized correctly. For hearing loss, cut-off values are used to create a binary variable of "serviceable hearing." Hearing at baseline, however, differs, creating a potential selection bias. To account for this bias, this table is created.
This table is described as using Cox predicted rates of loss of serviceable hearing, based on baseline hearing rates. It's looks like this:

My question is: how can I recreate this? Do you select a subgroup of patients that have this specific baseline (e.g. select all patients with a baseline PTA of <10 and a WRS between 90-100%, and consequently create a life table? ? I imagine there will be a lot of small groups (25!). Or is there any other technique, that is able to use more and potentially all your data for each estimation?

I work in R and use survfit and surv_summary to create the life tables.

Best Answer

In the cited paper,* the plots in Figure 3 and figure 4 indicate that the authors used baseline values of both WRS and PTA as continuous predictors. Such modeling of continuous predictors without binning into groups is generally the best preactice. Those plots are for single-predictor associations with outcome, with some type of flexible fit (perhaps restricted cubic splines; the authors didn't specify that detail in the methods).

Table 3 shows the associations of those predictors with outcome together in a two-predictor model, which they say was "developed using forward and backward selection." As they only show a single hazard ratio (HR) for each of PTA (2.07 per 10-dB change; 95% CI, 1.64–2.61) and WRS (1.48 per 10 percentage-point change; 95% CI, 1.27–1.72), it seems that non-linear terms were removed in the process and that interactions between WRS and PTA were either not considered or were found not to improve upon the model.

What's important with respect to your question is that there was no breakdown into subgroups. All data values for all participants went into constructing the model summarized in Table 3. Then that model was used to make predictions of outcomes for different hypothesized combinations of PTA and WRS. The predictions (with confidence intervals) are displayed in the table you show (Table 4 of the paper).

To reproduce the table you would need access to the original data. The HR values are relative to a baseline hazard, which is estimated from the data. The principles are quite straightforward and represent a standard use of Cox modeling for prediction, not just for description. In R, after using coxph() to build the model, the predict() function applied to the model along with sets of predictor values would give you these predicted probabilities of maintaining serviceable hearing.

*JB Hunter et al, "Hearing Outcomes in Conservatively Managed Vestibular Schwannoma Patients With Serviceable Hearing," Otology & Neurotology 39:e704–e711, 2018.

Related Solutions

Solved – Different “end of study” times for different cohorts – Cox PH model in survival analysis

Basically, we're assuming a Cox model with four groups. I'm not sure what you mean by stratifying for class as this wouldn't help you compare the four classes.

The question is whether or not we can use a standard Cox model when we're applying different censoring mechanisms to different groups of observations. And actually, we can.

An assumption of the model is independent right-censoring. What is actually meant by this, is independence of the censoring times and the event times conditional on the covariates of the model. In your example, we have to determine if it's reasonable to assume independent right-censoring. When we condition on the covariates (class being one of them), the censoring times are constants, thus independent of the event times. Therefore, the censoring times and event times are actually independent conditional on the covariates. When you know the value of the class covariate, the dependence between censoring time and event time vanishes, so to speak. So in your example, the assumption of independent right-censoring is still reasonable.

This is of course only true when the class covariate that influences the censoring is actually included in the model. Imagine that you didn't include class and furthermore assume that class and event time are not independent. Then (heuristically) information on the censoring time would give you information on the class, which in turn would give you information on the event time. Thus, event times and censoring times are not independent any longer and an assumption of the model just broke down.

A reference on this would be PK Andersen et al., Statistical Models based on Counting Processes, 1997, pp. 139-146.

A small simulation experiment might shed some light on the situation. The following R-code generates data from four groups, all with the same hazard.

data <- data.frame(grp = c(rep(1, 1000), rep(2, 1000), rep(3, 1000), rep(4, 1000)))

data$x <- rweibull(n = 4000, shape = 1, scale = 1)


data$event <- data$x < (data$grp == 1)*1 + (data$grp == 2)*.7 + 
  (data$grp == 3)*.5 + (data$grp == 4)*.3
data$x <- pmin(data$x, (data$grp == 1)*1 + (data$grp == 2)*.7 + 
                 (data$grp == 3)*.5 + (data$grp == 4)*.3)


coxph(Surv(time = x, event = event) ~ as.factor(grp), data = data)

If you run the code, you'll see that the estimates you get are good, we get coefficients close to one.

Without the data it's difficult to say why you're seeing this artefact of decreasing coefficients for increasing class. Off the top of my head:

What you're seeing might be due to a kind of selection bias. For each year some of the 700 people will not ever pass the exam (I'm guessing). If we imagine that a student fails the exam in a early year, he or she might be removed from the cohort all together. For instance, because the student quits the programme and therefore is no longer in the process of passing the exam. This would clearly constitute a selection bias as students that are passing the exam slowly (or never passing it) are removed from the early cohorts whereas in the later cohorts they have less time to quit the programme. That would give the estimates you report, even if there's no real differences between classes. See the following simulation experiment. It's identical to the above, but I'm just removing slow students, and more so in early years (this corresponds to drop-outs).

data <- data.frame(grp = c(rep(1, 1000), rep(2, 1000), rep(3, 1000), rep(4, 1000)))

data$x <- rweibull(n = 4000, shape = 1, scale = 1)

data <- subset(data, x < (data$grp == 1)*1 + (data$grp == 2)*1.3 + 
                 (data$grp == 3)*1.5 + (data$grp == 4)*1.7)

data$event <- data$x < (data$grp == 1)*1 + (data$grp == 2)*.7 + 
  (data$grp == 3)*.5 + (data$grp == 4)*.3
data$x <- pmin(data$x, (data$grp == 1)*1 + (data$grp == 2)*.7 + 
                 (data$grp == 3)*.5 + (data$grp == 4)*.3)


coxph(Surv(time = x, event = event) ~ as.factor(grp), data = data)

It goes without saying that this is just a guess of what could be happening with your data. Assuming that the negative coefficients are an artefact of some sort, it's not caused by the censoring mechanism, though.

Variables Needed for Cox Regression (Survival Analysis)

The answer depends on whether you are modeling covariates whose values change over time, and whether you are building a continuous-time or a discrete-time survival model.

If your model only uses covariate values in place at study entry (SurvTime = 0), then the comment by @user352188 is correct. The data format in that case just needs a single row per individual, which includes the covariate values at SurvTime = 0, the last follow-up time, and whether the last follow-up was an event time (as opposed to right-censoring).

If your model includes covariates whose values change over time, you need to have multiple data rows per individual. Each data row then includes for that individual a start time, a stop time, the covariate values in place during that time interval, and an indicator of whether or not the stop time is for an event. That's called the "counting process" data format.

The above assumes that you have a continuous-time model of survival. The vignettes of the R survival package go into more detail, with examples.

Sometimes there are only data at a small number of fixed times, say every year for 5 years (panel data). Such data are analyzed with discrete-time survival methods, a set of binomial regressions at each fixed time. In that case it's simplest to have a separate data row for each time point at which the individual is still at risk.

Best Answer

Related Solutions

Solved – Different “end of study” times for different cohorts – Cox PH model in survival analysis

Variables Needed for Cox Regression (Survival Analysis)

Related Question