Cox Model – Why P-Values Are Often Higher in Cox Proportional Hazard Model Than in Logistic Regression

cox-modellogisticrstatistical-powersurvival

I've been learning about the Cox proportional hazard model. I have a lot of experience fitting logistic regression models, and so to build intuition I've been comparing models fit using coxph from the R "survival" with logistic regression models fit using glm with family="binomial".

If I run the code:

library(survival)
s = Surv(time=lung$time, event=lung$status - 1)
summary(coxph(s ~ age, data=lung))
summary(glm(status-1 ~ age, data=lung, family="binomial"))

I get p-values for age of 0.0419 and 0.0254 respectively. Similarly if I use sex as a predictor, with or without age.

I find this puzzling because I would think that taking amount of time elapsed into account when fitting the model would give more statistical power than just treating death as a binary outcome, whereas the p values would seem consistent with one having less statistical power. What is going on here?

Best Answer

The logistic regression model assumes the response is a Bernoulli trial (or more generally a binomial, but for simplicity, we'll keep it 0-1). A survival model assumes the response is typically a time to event (again, there are generalizations of this that we'll skip). Another way to put that is that units are passing through a series of values until an event occurs. It isn't that a coin is actually discretely flipped at each point. (That could happen, of course, but then you would need a model for repeated measures—perhaps a GLMM.)

Your logistic regression model takes each death as a coin flip that occurred at that age and came up tails. Likewise, it considers each censored datum as a single coin flip that occurred at the specified age and came up heads. The problem here is that that is inconsistent with what the data really are.

Here are some plots of the data, and the output of the models. (Note that I flip the predictions from the logistic regression model to predicting being alive so that the line matches the conditional density plot.)

library(survival)
data(lung)
s = with(lung, Surv(time=time, event=status-1))
summary(sm <- coxph(s~age, data=lung))
# Call:
# coxph(formula = s ~ age, data = lung)
# 
#   n= 228, number of events= 165 
# 
#         coef exp(coef) se(coef)     z Pr(>|z|)  
# age 0.018720  1.018897 0.009199 2.035   0.0419 *
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
#     exp(coef) exp(-coef) lower .95 upper .95
# age     1.019     0.9815     1.001     1.037
# 
# Concordance= 0.55  (se = 0.026 )
# Rsquare= 0.018   (max possible= 0.999 )
# Likelihood ratio test= 4.24  on 1 df,   p=0.03946
# Wald test            = 4.14  on 1 df,   p=0.04185
# Score (logrank) test = 4.15  on 1 df,   p=0.04154
lung$died = factor(ifelse(lung$status==2, "died", "alive"), levels=c("died","alive"))
summary(lrm <- glm(status-1~age, data=lung, family="binomial"))
# Call:
# glm(formula = status - 1 ~ age, family = "binomial", data = lung)
# 
# Deviance Residuals: 
#     Min       1Q   Median       3Q      Max  
# -1.8543  -1.3109   0.7169   0.8272   1.1097  
# 
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)  
# (Intercept) -1.30949    1.01743  -1.287   0.1981  
# age          0.03677    0.01645   2.235   0.0254 *
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 268.78  on 227  degrees of freedom
# Residual deviance: 263.71  on 226  degrees of freedom
# AIC: 267.71
# 
# Number of Fisher Scoring iterations: 4
windows()
  plot(survfit(s~1))
windows()
  par(mfrow=c(2,1))
  with(lung, spineplot(age, as.factor(status)))
  with(lung, cdplot(age, as.factor(status)))
  lines(40:80, 1-predict(lrm, newdata=data.frame(age=40:80), type="response"),
        col="red")

It may be helpful to consider a situation in which the data were appropriate for either a survival analysis or a logistic regression. Imagine a study to determine the probability that a patient will be readmitted to the hospital within 30 days of discharge under a new protocol or standard of care. However, all patients are followed to readmission, and there is no censoring (this isn't terribly realistic), so the exact time to readmission could be analyzed with survival analysis (viz., a Cox proportional hazards model here). To simulate this situation, I'll use exponential distributions with rates .5 and 1, and use the value 1 as a cutoff to represent 30 days:

set.seed(0775)  # this makes the example exactly reproducible
t1 = rexp(50, rate=.5)
t2 = rexp(50, rate=1)
d  = data.frame(time=c(t1,t2), 
                group=rep(c("g1","g2"), each=50), 
                event=ifelse(c(t1,t2)<1, "yes", "no"))
windows()
  plot(with(d, survfit(Surv(time)~group)), col=1:2, mark.time=TRUE)
  legend("topright", legend=c("Group 1", "Group 2"), lty=1, col=1:2)
  abline(v=1, col="gray")

with(d, table(event, group))
#      group
# event g1 g2
#   no  29 22
#   yes 21 28
summary(glm(event~group, d, family=binomial))$coefficients
#               Estimate Std. Error   z value  Pr(>|z|)
# (Intercept) -0.3227734  0.2865341 -1.126475 0.2599647
# groupg2      0.5639354  0.4040676  1.395646 0.1628210
summary(coxph(Surv(time)~group, d))$coefficients
#              coef exp(coef)  se(coef)        z    Pr(>|z|)
# groupg2 0.5841386  1.793445 0.2093571 2.790154 0.005268299

In this case, we see that the p-value from the logistic regression model (0.163) was higher than the p-value from a survival analysis (0.005). To explore this idea further, we can extend the simulation to estimate the power of a logistic regression analysis vs. a survival analysis, and the probability that the p-value from the Cox model will be lower than the p-value from the logistic regression. I'll also use 1.4 as the threshold, so that I don't disadvantage the logistic regression by using a suboptimal cutoff:

xs = seq(.1,5,.1)
xs[which.max(pexp(xs,1)-pexp(xs,.5))]  # 1.4

set.seed(7458)
plr = vector(length=10000)
psv = vector(length=10000)
for(i in 1:10000){
  t1 = rexp(50, rate=.5)
  t2 = rexp(50, rate=1)
  d  = data.frame(time=c(t1,t2), group=rep(c("g1", "g2"), each=50), 
                  event=ifelse(c(t1,t2)<1.4, "yes", "no"))
  plr[i] = summary(glm(event~group, d, family=binomial))$coefficients[2,4]
  psv[i] = summary(coxph(Surv(time)~group, d))$coefficients[1,5]
}
## estimated power:
mean(plr<.05)  # [1] 0.753
mean(psv<.05)  # [1] 0.9253
## probability that p-value from survival analysis < logistic regression:
mean(psv<plr)  # [1] 0.8977

So the power of the logistic regression is lower (about 75%) than the survival analysis (about 93%), and 90% of the p-values from the survival analysis were lower than the corresponding p-values from the logistic regression. Taking the lag times into account, instead of just less than or greater than some threshold does yield more statistical power as you had intuited.

Related Solutions

Solved – Modelling mortality rates using Poisson regression

Without seeing the dataset (not available) it seems mostly correct. The nice thing about Poisson regressions is that they can provide rates when used as suggested. One thing that may be worth to keep in mind is that there may be overdispersion where you should switch to a negative binomial regression (see the MASS package).

The Poisson regression doesn't care whether the data as aggregated or not, but in practice non-aggregated data is frail and can cause some unexpected errors. Note that you cannot have surv == 0 for any of the cases. When I've tested the estimates are the same:

set.seed(1)
n <- 1500
data <- 
  data.frame(
    dead = sample(0:1, n, replace = TRUE, prob = c(.9, .1)),
    surv = ceiling(exp(runif(100))*365),
    gender = sample(c("Male", "Female"), n, replace = TRUE),
    diagnosis = sample(0:1, n, replace = TRUE),
    age = sample(60:80, n, replace = TRUE),
    inclusion_year = sample(1998:2011, n, replace = TRUE)
  )

library(dplyr)
model <- 
  data %>% 
  group_by(gender, 
           diagnosis,
           age,
           inclusion_year) %>% 
  summarise(Deaths = sum(dead),
            Person_time = sum(surv)) %>%
  glm(Deaths ~ gender + diagnosis + I(age - 70) + I(inclusion_year - 1998) + offset(log(Person_time/10^3/365.25)), 
      data = . , family = poisson)

alt_model <- glm(dead ~ gender + diagnosis + I(age - 70) + I(inclusion_year - 1998) + offset(log(surv/10^3/365.25)), 
    data = data , family = poisson)
sum(coef(alt_model) - coef(model))
# > 1.779132e-14
sum(abs(confint(alt_model) - confint(model)))
# > 6.013114e-11

As you get a rate it is important to center the variables so that the intercept is interpretable, e.g.:

> exp(coef(model)["(Intercept)"])
(Intercept) 
    51.3771

Can be interpreted as the base rate and then the covariates are rate ratios. If we want the base rate after 10 years:

> exp(coef(model)["(Intercept)"] + coef(model)["I(inclusion_year - 1998)"]*10)
(Intercept) 
     47.427

I've currently modeled the inclusion year as a trend variable but you should probably check for nonlinearities and sometimes it is useful to do a categorization of the time points. I used this approach in this article:

D. Gordon, P. Gillgren, S. Eloranta, H. Olsson, M. Gordon, J. Hansson, and K. E. Smedby, “Time trends in incidence of cutaneous melanoma by detailed anatomical location and patterns of ultraviolet radiation exposure: a retrospective population-based study,” Melanoma Res., vol. 25, no. 4, pp. 348–356, Aug. 2015.

Solved – compare non-nested Cox models

You can compare Cox regression models (coxph) in R with plrtest which is partial likelihood ratio test for non-nested coxph models:

require("survival")
require("nonnestcox") #github.com/thomashielscher/nonnestcox
pbc  <- subset(pbc, !is.na(trt))
mod1 <- coxph(Surv(time, status==2) ~ age, data=pbc, x=T)
mod2 <- coxph(Surv(time, status==2) ~ age + albumin + bili + edema + protime, data=pbc,  x=T)
mod3 <- coxph(Surv(time, status==2) ~ age + log(albumin) + log(bili) + edema + 
log(protime), data=pbc, x=T)
plrtest(mod3, mod2, nested=F) # non-nested models
plrtest(mod3, mod1, nested=T) # nested models

Best Answer

Related Solutions

Solved – Modelling mortality rates using Poisson regression

Solved – compare non-nested Cox models

Related Question