Solved – R function survfit(): give newdata reference values for factors with >2 levels

contrastscox-modelkaplan-meierrsurvival

The goal is to plot separate survival curves for each level of a categorical variable in R, using a coxph model including the variable and a second categorical variable with >2 levels.

If the second categorical variable had had only 2 levels, it is possible to give the proportion of one level as the reference value. This is frowned upon by the R community, but done by default in some other statistical software.

An example using the mgus dataset:

library(survival)
mgustest <- mgus
mgustest$agecat <- cut(mgustest$age, breaks=quantile(mgustest$age))
    mgustest$albcat <- cut(mgustest$alb, breaks=quantile(mgustest$alb, na.rm=T))
mfit.refs <- data.frame("sex"=mean(mgustest$sex == "female"),
                        "albcat"=c('(1.8,3]','(3,3.2]','(3.2,3.5]','(3.5,5.1]'))
mfit <- coxph(Surv(futime, death) ~ sex + albcat, data=mgustest)
plot(survfit(mfit, newdata=mfit.refs), col=c("black","red","blue","green"))

The newdata data frame looks like this:

      sex    albcat
1 0.4190871   (1.8,3]
2 0.4190871   (3,3.2]
3 0.4190871 (3.2,3.5]
4 0.4190871 (3.5,5.1]

Assuming the model:

mfit2 <- coxph(Surv(futime, death) ~ agecat + albcat, data=mgustest)

How should I build the newdata data frame to be able to fit separate survival curves for albcat levels with survfit (I only want four curves representing the albact levels)?

Thoughts: How do other statistical software produce these survival curves without specifying reference values for agecat? For example, how does SPSS do it?

Disclaimer: I agree that it makes little sense to assume that every individual in a sample has the sex 0.42, when the only occurring values are 0 and 1. Still, this assumption is common in other statistical software, and it is useful to be able to replicate models resulting from such assumptions in R. I have googled for a couple of hours about the matter, but could not find an answer about how to deal with categorical variables with >2 levels.

Edit:
I want to clarify based on the first comment. I do not want 4*4=16 survival curves like in this example:

mfit.refs2 <- data.frame(
"agecat"=c(rep('(34,55]', 8),rep('(55,64]',8),rep('(64,72]', 8),rep('(72,90]', 8)),   
"albcat"=rep(c('(1.8,3]','(3,3.2]','(3.2,3.5]','(3.5,5.1]'), 4))
mfit2 <- coxph(Surv(futime, death) ~ agecat + albcat, data=mgustest)
plot(survfit(mfit2, newdata=mfit.refs2), col=c("black","red","blue","green"))

Best Answer

I found a solution that produces survival curves equivalent to the standard survival curves in SPSS (judged visually). In short, I had to pre-create the dummy variables from agecat and insert the dummy variables into the coxph() function instead of agecat. After that, you specify proportions for each dummy variable in the "newdata" data frame, like I did with sex in the example.

A solution that avoids the pre-creation step would be great if it is possible, but I was not able do it like that. Any refinements or alternative solutions would be appreciated.

Here is the code:

#create three dummy variables from agecat
out <- contrasts(mgustest$agecat)[as.numeric(mgustest$agecat),]
colnames(out) <- paste("agecat",colnames(out),sep="")
out <- apply(out,2, as.logical)
mgustest <- data.frame(mgustest, out)

names(mgustest) #variable names to use in mfit3
prop.table(table(mgustest$agecat)) #proportions to use as reference values

#coxph with three dummy variables instead of "agecat"
mfit3 <- coxph(Surv(futime, death) ~ agecat.55.64. + agecat.64.72. + agecat.72.90. + albcat, 
               data=mgustest)
#newdata data frame specifying proportion for each of the dummy variables
mfit.refs3 <- data.frame("agecat.55.64."=prop.table(table(mgustest$agecat))[[2]],
                            "agecat.64.72."=prop.table(table(mgustest$agecat))[[3]],
                        "agecat.72.90."=prop.table(table(mgustest$agecat))[[4]],
                        "albcat"=c('(1.8,3]','(3,3.2]','(3.2,3.5]','(3.5,5.1]'))
#final plot
plot(survfit(mfit3, newdata=mfit.refs3), col=c("black","red","blue","green"))