Cox Regression Outputs – How to Interpret Based on Data Type of Independent Categorical Variable

cox-modelhazardrregressionsurvival

I have the following data frame which consists of a grouping variable, as well as time ans status variables for survival analysis:

    sample_df <- structure(list(group = c("Group C", "Group C", "Group B", "Group B", 
"Group C", "Group C", "Group B", "Group C", "Group C", "Group B", 
"Group B", "Group C", "Group B", "Group B", "Group C", "Group A", 
"Group C", "Group B", "Group C", "Group B", "Group B", "Group B", 
"Group A", "Group C", "Group B", "Group C", "Group C", "Group C", 
"Group B", "Group C", "Group C", "Group A", "Group B", "Group C", 
"Group C", "Group B", "Group B", "Group C", "Group B", "Group C", 
"Group C", "Group C", "Group C", "Group B", "Group C", "Group C", 
"Group C", "Group A", "Group C", "Group C"), status = c(0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L
), time = c(379L, 120L, 380L, 419L, 365L, 376L, 1499L, 727L, 
607L, 6L, 375L, 364L, 64L, 3L, 366L, 368L, 1523L, 57L, 104L, 
180L, 4L, 1111L, 852L, 433L, 2562L, 534L, 490L, 1475L, 1794L, 
7L, 744L, 754L, 1484L, 365L, 746L, 161L, 421L, 358L, 532L, 36L, 
368L, 523L, 2262L, 1618L, 247L, 83L, 365L, 448L, 1303L, 436L)), class = "data.frame", row.names = c(NA, 
-50L))

In its present form, the grouping variable is a character variable. When I run Cox regression using the following code, I will get this output:

summary(coxph(Surv(time, status) ~ group, data = sample_df))

So I guess this gives me the hazard ratio and p-values of Groups B and C compared with Group A, respectively.

If I change the grouping variable to an unordered factor variable and re-run the Cox regression, I get the same result. So far so good. However, if I change the grouping variable to an ordered factor, I get the following output:

sample_df$group_ord <- 
  sample_df$group %>% 
  factor(levels = c("Group A", "Group B", "Group C"),
         ordered = TRUE)

summary(coxph(Surv(time, status) ~ group_ord, data = sample_df))

I'm having trouble interpreting these results. What do the "L" and "Q" the the end of the grouping variables stand for? And what do the HR ans p-values refer to??

And finally, I tried using the predictor variable as a numerical variable (changing the categorical variable to a numerical one), like this:

sample_df <- 
  sample_df %>% 
  mutate(
    group_num = case_when(
      group == "Group A" ~ 0,
      group == "Group B" ~ 1,
      group == "Group C" ~ 2
    )
  )

Running Cox regression I get:

summary(coxph(Surv(time, status) ~ group_num, data = sample_df))

This gives me one HR and one p-value. Am I correct in interpreting this HR as the average HR increase per increase in 1 unit of the independent variable?

I'm sorry for this messy post but I hope someone can help me correctly interpret these results.

Best Answer

An ordinal predictor is modeled with polynomials. That allows for a general shape of the association between the predictor and outcome. So "L" stands for the "linear" term and "Q" for the "quadratic" term with a 3-level ordinal predictor. Hazard ratios and p-values are those associated respectively with the linear and quadratic terms in the polynomial.

If you use a numeric predictor as in your last example, the model assumes it has a linear association with outcome--in a Cox model, with the log-hazard. IF that association with outcome is truly linear, then the HR is that for a 1-unit increase in the predictor. If the association isn't truly linear, however,I'd be reluctant to call that an "average HR increase."

Related Solutions

Solved – Get standard error of exponentiated coefficient in cox regression

You can do it manually by calculating $se(sex) \cdot \exp(sex)=0.1672 \cdot 0.5880=.0983136$ since the derivative of $\exp{}$ is $\exp{}$ itself, or like this using svycontrast() in the survey package:

library("survival")
library("survey")
data("lung") #From the survival package
res.cox <- coxph(Surv(time, status) ~ sex, data = lung)
summary(res.cox)
svycontrast(res.cox, quote(exp(sex)))

which yields

         nlcon     SE
contrast 0.588 0.0983

R – How to Perform a Likelihood Ratio Test on a New Dataset Using Cox Model

As best as I can tell from the linked publication, the models based on the training set were applied, with the same regression coefficients, to the data in the internal and external validation sets.

The problem with your proposed approach is that the coxph() function will try to re-fit models to pred1 and pred2 on the new data. You would get a new regression coefficient for each model, a slope for how each of pred1 and pred2 is associated with log hazard in the new data set. That's useful in some contexts; for example, it's how the validate() function in Harrell's rms package evaluates the optimism in a model. But that's not what you want.

The trick with likelihoods based on a model with pre-defined regression coefficients is to recognize that you just want to evaluate the (partial) likelihood of the data, given that model. AdamO shows, with code, a very simple way to do this with Cox models:

If you simply want the partial likelihood, why not fool R into giving it to you? Simply initialize beta and allow no iterations, then extract the loglik value from the coxph object.

I don't know if that's precisely how the authors of this study did this, but that's the basic idea. An alternative might be to use offset(pred1) and offset(pred2) to force their regression coefficients to be exactly 1, then extract the log likelihoods.

For the C-index calculations, the authors say that they used tools in the compareC package. I don't have experience with that package. The basic R survival package has a concordance() function that can evaluate the C-index for a model applied to new data.

A few warnings.

First, as the authors note, even Harrell doesn't find the C-index useful for comparisons among models. It's a good measure of discrimination for a single model.

Second, a split into training and validation sets isn't the most efficient use of data when there are so few cases. See this post, for example. Resampling from a combined data set is generally better unless there are tens of thousands of cases.

Third, the authors' use of LASSO might have found a set of predictors that work OK, but the same procedure applied to a new data set might well find a different set of predictors. It's not clear that the authors' use of bootstrapping evaluated the entire model-building process including the LASSO predictor selection. That's another advantage of a full resampling-based validation of the model-building process in this situation.

Best Answer

Related Solutions

Solved – Get standard error of exponentiated coefficient in cox regression

R – How to Perform a Likelihood Ratio Test on a New Dataset Using Cox Model

Related Question