Solved – Dealing with factors in cox model using coxph

categorical datacox-modelinterpretationsurvival

I am trying to analysis a dataset with survival data, I am new to cox model and I am not sure how to interpret covariate factors. I have read the survival R package documentation and online examples but I am still very confused.

The covariate I am trying to understand is frailty and contains three levels: frail, pre-frail, non-frail. I am using:

coxph(Surv(time,status)~ frailty, data=data)
# Call:
# coxph(formula = Surv(time, status) ~ frailty, data = data)
# 
#                    coef exp(coef) se(coef)     z     p
# frailtynon-frail -1.749     0.174    0.443 -3.95 8e-05
# frailtypre-frail -0.415     0.661    0.275 -1.51  0.13
# 
# Likelihood ratio test=21  on 2 df, p=2.78e-05
# n= 151, number of events= 70 
  1. It doesn't give me a line for the level frail, is there a way to get it?
  2. The exp(coef) and p-value of the level non-frail are low does it means that going from non-frail to any of the two other levels (frail, pre-frail) as a significant decrease on survival.
  3. The p-value for the level pre-frail is not significant, if it is computed using frail and non-frail, is there a way to compute it without using non-frail?

Best Answer

Between the question and your comment there are two questions here: the comparisons that go into the displayed p-values, and how to interpret the coefficients in Cox regression.

The default in R, at least, is to present all regression results (linear, Cox, generalized linear, etc.) for levels of a categorical variable with respect to its reference level. This can lead to confusion when statistical packages differ in their choices of reference level, as seen in this question. You obviously can't get a comparison of the reference level against itself. In general, you will have 1 less coefficient than you have levels of the variable.

In the semi-parametric Cox regression with results presented this way, the reference survival curve is based on reference levels of categorical variables and values of 0 for continuous variables. This reference survival curve is the logical equivalent of the intercept in linear regression. The regression coefficients (shown as coef in the output) are calculated for changes in log-hazard around that baseline; the exp(coef) for a variable is thus its hazard ratio relative to baseline. So the interpretation of the coefficient for non-frail in your comment is not correct; the hazard ratio for non-frail/frail is what is shown for exp(coef), 0.174; the hazard for non-frail is only 17.4% of the hazard for frail.

If you are interested in testing other combinations of predictors you can define different contrasts. It can be a bit tricky to get started with this; you do have to think carefully about what comparisons you wish to make. This UCLA page provides one introduction, taking advantage of the glht() function in the multcomp package to minimize the amount of hand coding that might otherwise be involved. Also take a look at this Cross Validated page. Practice helps. With Cox models, the linear combinations of factor levels for other contrasts and statistical tests will be taken on the coefficients themselves, not on the hazard ratios. In Cox model coefficients add; hazard ratios multiply.

One final warning: there is a technical meaning of "frailty" in survival analysis that might lead to further confusion. See this answer for one introduction to its extended meaning as a within-group correlation of survival rather than simply an individual's health status.