How to Estimate Survival Probabilities in Multivariate Cox Proportional Hazards Model

cox-modelrsurvival

I'm working through the examples for Cox proportional hazard models provided in http://www.sthda.com/english/wiki/cox-proportional-hazards-model, and I am on the last example labeled "Visualizing the estimated distribution of survival times". When I run the code posted at the bottom (garnered from the link), I get the output shown immediately below which matches the plot provided in the link.

All the link says in explanation is "Consider that, we want to assess the impact of the sex on the estimated survival probability. In this case, we construct a new data frame with two rows, one for each value of sex; the other covariates are fixed to their average values (if they are continuous variables) or to their lowest level (if they are discrete variables). For a dummy covariate, the average value is the proportion coded 1 in the data set. This data frame is passed to survfit() via the newdata argument:". Is there a better, simple, "plain English" way to describe what is being presented in the plot? On the surface, I get "assess the impact of the sex on the estimated survival probability" but that seems incomplete: why all the computational gyrations through multivariate analysis, fixing other covariates to their average values, and identifying/coding dummy (confounding) variables?

enter image description here

Code:

library(ggplot2)
library(survival)
library(survminer)

head(lung) # << check to see that lung data is loaded into workspace

# Cox regression of time to death on the time-constant covariates:
res.cox <- coxph(Surv(time, status) ~ age + sex + ph.ecog, data =  lung)
summary(res.cox)

# assess the impact of the sex on the estimated survival probability; create the new data:  
sex_df <- with(lung,
               data.frame(sex = c(1, 2), 
                          age = rep(mean(age, na.rm = TRUE), 2),
                          ph.ecog = c(1, 1)
               )
)

# Survival curves 
# the below didn't work until I added data = lung, in a departure from the text
fit <- survfit(res.cox, data = lung, newdata = sex_df)
ggsurvplot(fit, 
           conf.int = TRUE, 
           legend.labs=c("Sex=1", "Sex=2"),
           ggtheme = theme_minimal()
           )

Best Answer

The method you describe is known as the average covariate method. In plain english, you have fit a Cox model which allows you to predict the survival probability of an individual at time t given their observed covariate values (and an estimate of the baseline hazard, but lets ignore this for now).

Now if you want to show the effect of sex on the survival probability, it would be nice to have one survival curve for each sex. However, since you also included age and ph.ecog in the Cox model, you cannot estimate a single survival curve for each sex directly. The Cox model only allows you to obtain an age and sex and ph.ecog specific survival probability prediction. For example, you could plot the conditional survival curve of an individual aged 50, male, and ph.ecog = 1. To get only two marginalized curves, the average covariate method uses the average of the other covariates. The rationale behind this is that you plot the survival curve of the "average" individual with only the sex being different.

This method has been shown to be highly biased and should generally not be used. See for example the excellent article by Nieto & Coresh https://pubmed.ncbi.nlm.nih.gov/8629613/. What you are usually actually looking for are confounder adjusted survival curves, which can be estimated using various other methods (see https://onlinelibrary.wiley.com/doi/full/10.1002/sim.9681 for a review and simulation study). The adjustedCurves package (https://cran.r-project.org/package=adjustedCurves) directly implements those methods in R.

Related Question