Survival Analysis – Getting Baseline Survival Probability in Piecewise Cox Model

cox-modelinteractionp-valuesurvival

I use Cox regression model to develop a risk prediction model for a disease and want to get a risk equation from this. The codes for that below is something like below

data.split <- survSplit(Sur(days,event) ~ ., data = data, cut=c(1826), 
    episode="timeperiod"

model <- coxph(Surv(Sur(days,event) ~ treatment:strata(timeperiod) + age + 
                    sex + varX + age:sex, data=data.split

Because of the violation of the PH assumption, the effect of the treatment is split into <=5 years and >5 years. There is also an interaction term (actually the model contains much more predictors, but for simplicity, I just put a few variables here)

Now I use the following function to get the baseline survival probability at 5 years

test <- basehaz(model, centered = T)
test$surtest <- exp (-test$hazard)
test$surtest[test$time==1826]

However, I got a warning message after running the first line of the above code, saying that the models contains interaction terms and needs to consider adding a newdata argument.

Also, I use the following code to get the summation of the coefficients with the mean covariate values but I am not sure whether this is correct for a piecewise model with interaction term
sum(coef(model) * model$means)

I would appreciate for your help on how to get the above two values correctly.

Best Answer

First, there is no "risk equation" for a Cox model, although a non-parametric baseline hazard can be estimated. If you want an equation you need to specify a parametric model.

Second, asking for a baseline hazard with centered = TRUE is not useful when there are categorical predictors, and that's particularly true when there are interactions. The caution that you see after trying to call basehaz() is quite explicit and is seen even without time strata. Here's a simple example:

> library(survival)
> cox1 <- coxph(Surv(time,status)~age*sex,data=lung)
> bh1 <- basehaz(cox1, centered=T)
Warning message:
In survfit.coxph(fit, se.fit = FALSE) :
  the model contains interactions;
  the default curve based on columm means of the X matrix
  is almost certainly not useful.
  Consider adding a newdata argument.

> mean(lung$sex)
[1] 1.394737
> mean(lung$age)
[1] 62.44737

So a baseline hazard with centered = T, if it were to be reported, would be for an individual with sex = 1.394737 and age = 62.44737. That doesn't make a lot of sense, although you could request the hazard for that covariate combination from the survfit() function that is called by basehaz().

> bhCentered <- survfit(cox1,newdata=data.frame(sex= 1.394737,age= 62.44737))

You could plot the survfit object bhCentered or extract survival over time from bhCentered$surv and bhCentered$time.

Third, you did not set up your time-stratified model correctly. As Section 4.1, "Step functions," of the time dependence vignette indicates, after calling survSplit() you need to use the counting-process format, in your case Surv(tstart, days, event), as the outcome.

Fourth, with time strata you need to take extra steps for generating survival curves. Those steps are explained in detail in that section of the vignette, which says:

The default curve uses the mean covariate values, which is always problematic and completely useless in this case.

With your cut at days = 1826 there are no predictions for the first time stratum after that time, and no predictions for the second time stratum before that time. So you must provide survfit() with a new data set.

That data set must contain time intervals and the corresponding time strata, the status indicator (required but not used), and values of the covariates. You use the id argument to survfit() to indicate which time intervals/strata should be combined into single curves to represent the entire time course of interest. If you really want to do this yourself, I would recommend using realistic values of each of the covariates rather than the centered values. For example in your case:

> newData <- data.frame(tstart=c(0, 1826),days=c(1826,2*1826), event= c(0,0),timeperiod=c(1,2), treatment=c(1,1),age=c(60,60),sex=c("M","M"),varX=c(25,25),id=c(1,1))

> newData
  tstart days event timeperiod treatment age sex varX id
1      0 1826     0          1         1  60   M   25  1
2   1826 3652     0          2         1  60   M   25  1

would allow you to get a single reference survival curve out to 10 years for a 60-year-old Male under treatment 1 with a varX value of 25, by calling

sfit <- survfit(model, newdata=newData, id=id)

If you want to adjust for different sets of covariate values by hand, with covariates fixed in time you can just sum up the products of the coefficients times the corresponding differences in covariate values from those of that reference case.

Fifth, you risk a lot of trouble and errors trying to do these types of calculations yourself. Take advantage of the tools provided by the software, like the survfit() function, which also make it straightforward to display confidence intervals around the point estimates.

Finally, as Frank Harrell noted in a comment, there probably isn't a sharp jump in hazard at 5 years. You might be better off using a continuous time-varying coefficient, as explained later in the vignette.

Related Question