Cox Model – Difference Between Z-Value and Wald Statistic in Cox Model’s Summary Function

cox-modelrsurvivalwald test

When I run the code posted at the bottom (using the summary() function of the R survival package, I get the output shown immediately below:

Some sources (http://www.sthda.com/english/wiki/cox-proportional-hazards-model) state that the z-value is the “Wald statistic value” and continues “It corresponds to the ratio of each regression coefficient to its standard error (z = coef/se(coef)). The Wald statistic evaluates whether the beta (β) coefficient of a given variable is statistically significantly different from 0. From the output above, we can conclude that the variable sex have highly statistically significant coefficients.”

On the other hand, other sources state “The z-value is a standardized score that measures the number of standard deviations a parameter estimate is from its null hypothesis value. It is calculated by dividing the estimated coefficient by its standard error. The z-value is used to calculate p-values and to assess the statistical significance of the coefficient. The Wald statistic, on the other hand, is a measure of the overall significance of a variable in the Cox proportional hazards model. It is calculated by dividing the squared coefficient estimate by its estimated variance. The Wald statistic is used to test the null hypothesis that the coefficient of a variable is equal to zero, which indicates that the variable is not a significant predictor of the outcome. In summary, the z-value is used to assess the statistical significance of individual coefficients, while the Wald statistic is used to test the overall significance of a variable in the Cox proportional hazards model. Both are important measures in assessing the validity and usefulness of a Cox proportional hazards model, but they serve different purposes.”

Which, if either, description of the z-value and Wald statistic is correct?

Code:

library(survival)
library(survminer)

head(lung)

res.cox <- coxph(Surv(time, status) ~ sex, data = lung)
summary(res.cox)

Best Answer

Both are correct, if there's just a single coefficient involved.

"[D]ividing the squared coefficient estimate by its estimated variance" gives a statistic evaluated against a chi-square distribution with 1 degree of freedom. That's just the square of the z-statistic in your display, which is evaluated against a standard normal distribution. $(-3.176)^2=10.09$

As a chi-square distribution with 1 degree of freedom is the distribution of a squared standard normal, inference is identical regardless of your definition.

The overall Wald test in a model with multiple coefficients is a joint test of the hypothesis that all coefficients equal 0. With more than 1 coefficient, that can't be done with a z-test; the more general chi-square form is used, with an appropriate number of degrees of freedom.

A Wald test can also be used to evaluate subsets of coefficients, for example all those associated with a multi-level categorical predictor or for a predictor along with all of its interactions. Search this site for "chunk test" for more details.

Related Solutions

Cox Model – Extrapolating Effect of Covariable Changes in Cox Proportional Hazards Models

I would suggest you do it non-parametrically. The procedure as you describe it imposes assumptions on the way the failure functions can relate to each other, basically because the Cox model introduces the assumption of proportional hazards. Therefore, I would argue that the red and black curves in the plot are a visualization of the model, more than they are estimates of failure functions. Not that those two things couldn't coincide, but why make this further assumption?

If you want to do something similar but non-parametrical, I would suggest using the Kaplan-Meier estimates instead. You would have to divide the weight variable into groups (assuming it's continuous), e.g. "low" and "high". You would still be able to do the counterfactual analysis that you want, simply by making a "conditional" KM plot similar to the green one above. So the green curve would be the KM of the "high" group until age $40$. At age $40$ the KM of the "low" kgs group (for $+40$ years) would continue, pasted onto the "high" ending at $40$. The KM estimate is the estimated probability of reaching age $t$, thus, for the hypothetical individual changing weight groups we can think of the probability of reaching age $40 + s$ as the probability of living from $40$ to $40 + s$ in the low weight group given survival until $40$ times the probability of living from $0$ to $40$ in the high weight group. This will exactly correspond to "pasting" the KM estimates together at age $40$. Note that the KM estimates themselves are products of conditional probabilities (conditional on survival until some time point). In symbols and if $X$ is a stochastic variable describing the time of failure of this hypothetical individual:

$$ P(X > 40 + s) = P(X > 40 + s | X > 40)P(X > 40), \ s \geq 0. $$

In conclusion, this amounts to the KM plot for "high" until age $40$ and at $40$ we use the conditional survival history of "low" (conditional on survival until $40$). To show it on a plot:

Conditional KM estimate of (highly) hypothetical subject

Some code to produce the plot, using built-in functions in R

library(ggplot2)
library(survMisc)
library(survival)


X1 <- rexp(n = 20)*50
X2 <- rexp(n = 20)*100

Sfit1 <- survfit(Surv(time = X1) ~ 1)
Sfit2 <- survfit(Surv(time = X2[X2 > 40]) ~ 1)

v  <- autoplot(Sfit1)$plot
p1 <- tail(v$data$surv[v$data$time < 40], 1)
t1 <- tail(v$data$time[v$data$time < 40], 1)


u <- autoplot(Sfit2)$plot
x <- c(t1, as.vector(u$data$time)[-1])



Sdata <- data.frame(x = x, y = p1*as.vector(u$data$surv), st = "2")

autoplot(Sfit1, title=NULL)$plot + geom_step(data=Sdata, aes(x=x, y=y, st=st))

However, one should probably still consider what the purpose of the plot really is. We're not really describing any of our subjects and it's not clear that we're describing a hypothetical (but plausible) subject either. You would want to remember that you're assuming that the hazard changes instantaneously, not only that the weight changes instantaneously. I'm no expert on human physiology, but a sudden weight loss probably entails other side-effects that are not appropriately modelled.

This is simulated data, but one should also keep in mind that the weight covariate is time-dependent, especially since we're also modelling young people and children. Treating it as time-independent is probably a bad idea. Also, the heavy people will be the ones that entered to study as adults as weight is measured at entry. The OP seems to be aware of this, though, but I thought I'd mention it anyway.

Solved – How to validate Cox Proportional Hazards model

This is not what you do to validate an event time model. You need a smooth calibration curve at each of a series of time horizons plus validation of predictive discrimination, e.g., Somers' Dxy rank correlation (c-index). The R rms package makes this easy, and it can use the bootstrap to correct for overfitting if you are honest about including all candidate variables in the model. See my course notes for details: http://biostat.mc.vanderbilt.edu/rms

Best Answer

Related Solutions

Cox Model – Extrapolating Effect of Covariable Changes in Cox Proportional Hazards Models

Solved – How to validate Cox Proportional Hazards model

Related Question