Solved – How to interpret the output of predict.coxph

cox-modelpredictive-modelsrelative-risk

After fitting a coxmodel it is possible to make predictions and retrieve the relative risk of new data. What I don't understand is how the relative risk is computed for an individual and what is it relative to (i.e. the average of the population)? Any recommendations for resources to help understand (I not very advanced in survival analysis so the simpler the better)?

Best Answer

Edit: the following description applies to survival versions 3.2-8 and below. Starting with version 3.2-9, the default behavior of predict.coxph() changes with respect to treating 0/1 (dummy indicator) variables. See NEWS.

predict.coxph() computes the hazard ratio relative to the sample average for all $p$ predictor variables. Factors are converted to dummy predictors as usual whose average can be calculated. Recall that the Cox PH model is a linear model for the log-hazard $\ln h(t)$:

$$ \ln h(t) = \ln h_{0}(t) + \beta_{1} X_{1} + \dots + \beta_{p} X_{p} = \ln h_{0}(t) + \bf{X} \bf{\beta} $$

Where $h_{0}(t)$ is the unspecified baseline hazard. Equivalently, the hazard $h(t)$ is modeled as $h(t) = h_{0}(t) \cdot e^{\beta_{1} X_{1} + \dots + \beta_{p} X_{p}} = h_{0}(t) \cdot e^{\bf{X} \bf{\beta}}$. The hazard ratio between two persons $i$ and $i'$ with predictor values $\bf{X}_{i}$ and $\bf{X}_{i'}$ is thus independent of the baseline hazard and independent of time $t$:

$$ \frac{h_{i}(t)}{h_{i'}(t)} = \frac{h_{0}(t) \cdot e^{\bf{X}_{i} \bf{\beta}}}{h_{0}(t) \cdot e^{\bf{X}_{i'} \bf{\beta}}} = \frac{e^{\bf{X}_{i} \bf{\beta}}}{e^{\bf{X}_{i'} \bf{\beta}}} $$

For the estimated hazard ratio between persons $i$ and $i'$, we just plug in the coefficient estimates $b_{1}, \ldots, b_{p}$ for the $\beta_{1}, \ldots, \beta_{p}$, giving $e^{\bf{X}_{i} \bf{b}}$ and $e^{\bf{X}_{i'} \bf{b}}$.

As an example in R, I use the data from John Fox' appendix on the Cox-PH model which provides a very nice introductory text. First, we fetch the data and build a simple Cox-PH model for the time-to-arrest of released prisoners (fin: factor - received financial aid with dummy coding "no" -> 0, "yes" -> 1, age: age at the time of release, prio: number of prior convictions):

> URL   <- "https://socialsciences.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
> Rossi <- read.table(URL, header=TRUE)                  # our data
> Rossi[1:3, c("week", "arrest", "fin", "age", "prio")]  # looks like this
  week arrest fin age prio
1   20      1  no  27    3
2   17      1  no  18    8
3   25      1  no  19   13

> library(survival)                                      # for coxph()    
> fitCPH <- coxph(Surv(week, arrest) ~ fin + age + prio, data=Rossi)    # Cox-PH model
> (coefCPH <- coef(fitCPH))                              # estimated coefficients
     finyes         age        prio 
-0.34695446 -0.06710533  0.09689320

Now we plug in the sample averages for our predictors into the $e^{\bf{X} \bf{b}}$ formula:

meanFin  <- mean(as.numeric(Rossi$fin) - 1)   # average of financial aid dummy
meanAge  <- mean(Rossi$age)                   # average age
meanPrio <- mean(Rossi$prio)                  # average number of prior convictions
rMean <- exp(coefCPH["finyes"]*meanFin        # e^Xb
           + coefCPH["age"]   *meanAge
           + coefCPH["prio"]  *meanPrio)

Now we plug in the predictor values of the first 4 persons into the $e^{\bf{X} \bf{b}}$ formula.

r1234 <- exp(coefCPH["finyes"]*(as.numeric(Rossi[1:4, "fin"])-1)
           + coefCPH["age"]   *Rossi[1:4, "age"]
           + coefCPH["prio"]  *Rossi[1:4, "prio"])

Now calculate the relative risk for the first 4 persons against the sample average and compare to the output from predict.coxph().

> r1234 / rMean
[1] 1.0139038 3.0108488 4.5703176 0.7722002

> relRisk <- predict(fitCPH, Rossi, type="risk")   # relative risk
> relRisk[1:4]
        1         2         3         4 
1.0139038 3.0108488 4.5703176 0.7722002

If you have a stratified model, the comparison in predict.coxph() is against the strata-averages, this can be controlled via the reference option that is explained in the help page.

Interpretation of the graph in your case

Note: The y-axis is not always the relative risk as in the example given in the vignette of the dlnm package. This is only the case in their example, because they used mortality data and Poisson regression models. In their framework, the exponentiated regression coefficient from the Poisson models $RR=\exp(\hat{\beta})$ is the relative risk. This is analogous to exponentiating the regression coefficients in logistic regression, which is the odds ratio.

Can I still use the model?

Yeah, you can still use such a model.

Let's summarize what you do:

You use natural cubic B-splines as basis functions instead of polynomials to model the relationship between temperature and $\mathrm{CO}_{2}$ (arglag with option type="ns" instead of type="poly")
You assume that the effect of temperature is non-linear, as you specify argvar as splines. One important thing you have to know for the interpretation of the plots is that the function crossbasis automatically centeres the values at the predictor mean (i.e. the mean temperature) if not specified otherwise. This is the reference value with which the predictions are later compared in the graphics.
You consider lags up to 12 (option lag=12 in crossbasis). (Btw: Why do you suppress the warnings?)
You calculate a GLM with Gaussian errors and the identity link function, which is equivalent to a simple linear regression (OLS). You could have used the lm function instead.

The plot that you have provided is interpreted as follows: The x-axis is the lag.

Interpretation of the values of the y-axis: The y-axis depicts the changes in $\mathrm{CO}_{2}$ concentration associated with an increase of 10, 20 or 30°C compared to the mean temperature. If predicted change is 0, this means that an increase in temperature is not associated with an increase in $\mathrm{CO}_{2}$ concentration compared to $\mathrm{CO}_{2}$ concentration at mean temperature: The predicted $\mathrm{CO}_{2}$ concentration is the same at $\bar{x}_{Temp}+z$ degrees (where $z$ is any amount, say 10 or 20 degrees) and at mean temperature $\bar{x}_{Temp}$.

This means that for an increase in temperature of 10°C, the temperature at lag 0 (on the same hour) increases the $\mathrm{CO}_{2}$ concentration compared to the mean temperature. Because you specified cumul=TRUE in crosspred, the effects are cumulative. The cumulative effects of an increase of 10°C are quasi nonexistent after 4 hours compared to the mean temperature. This suggests that the non-cumulative effects are negative at lags 1-4 and null effects from that on.

For temperature increases of 20 or 30°C, the cumulative effects on the $\mathrm{CO}_{2}$ concentration are lower in the first 1-4 hours compared to $\mathrm{CO}_{2}$ at the mean temperature. As with temperature increases of 10°C, the cumulative effects are practically nonexistent after 4 or 5 hours. Again: $\mathrm{CO}_{2}$ concentrations are the same at mean temperature and at an increase in temperature of 20 or 30°C after 4 or 5 hours.

I think a contour plot would be easier to interpret. Try the following code:

plot(cp, xlab="Temperature", col="red", zlab="CO2", shade=0.6,
     main="3D graph of temperature effect")

Interpretation of the example given in the vignette of the dlnm packge

First, a little something about distributed lag models. They have the form:

$$ Y_{i}=\alpha + \sum_{l=0}^{K}\beta_{j}x_{t-l} + \text{other predictors} +\epsilon_{i} $$ where $K$ is the maximum lag and $x$ is a predictor. This is just fitted using a multiple linear regression. So the coefficient $\beta_{1}$ would estimate the effect of $x_{t-1}$ of the day before on $Y_{t}$. In essence, multiple lags of the predictors are included in the model simultaneously. This obviously has the problem that the lagged predictors are highly correlated (autocorrelation).

A more advanced method are polynomial distributed lag models. It has the same basic formula as above, but the impulse-response function is forced to lie on a polynomial of degree $q$ (link to a paper for Stata):

$$ \beta_{i} = a_{0} + a_{1}i + a_{2}i^2 +\ldots+a_{q}i^q $$ where $q$ is the degree of the polynomial and $i$ the lag length. Another formulation is $$ \beta_{i} = a_{0} + \sum_{j=1}^{q}a_{j}f_{j}(i) $$

Where $f_{j}(i)$ is a polynomial of degree $j$ in the lag length $i$. A good introduction to the dlnm package and polynomial distributed lag models can be found here.

These models are often used in studies about air-pollution and health because air-pollution has lagged effects on health outcomes.

Let's look at this graph from the vignette of the dlnm package (page 13):

DLNM poylag

The degree of the polynomials was $q=4$ in this case so the green line is a polynomial of 4th degree. The y-axis is the relative risk (RR) estimated via Poisson regression and the x-axis the considered lag. The relative risk has the following interpretation: Persons who were exposed have a $(RR-1)\cdot100\%$ higher/lower chance of getting the outcome (e.g. death, lung cancer, etc.) compared to people who were not exposed. If $RR>1$ this means a positive association and if $RR<1$ means a protective association. A $RR=1$ means no association. We see that for every increase of $\textrm{PM}_{10}$ by 10 units ($\mu \mathrm{g}/m^{3}$), there is a $(1.001-1)\cdot100\%=0.1\%$ increase in the risk to die at lag 0 (i.e. on the same day as the exposure). Strangely, the exposure from about 9 days ago is protective: an increase of 10$~\mu \mathrm{g}/m^{3}$ is associated with a decreased risk to die compared to people with 10 units less exposure. We can also see that the exposure from 15 days before doesn't play a role (i.e. $RR\approx1$).

Let's look at the cumulative relative risk:

DLNM cumulative

This is the same as before but the effects are cumulated over time (i.e. summing all contributions from the lags up to the maximum lag). The red line starts at the same point as the green line in the first graphic (i.e. $\approx1.001$). We can see that people who have been exposed for five days have an increased cumulative risk of about $(1.005-1)\cdot100\%=0.5\%$ to die compared to non-exposed people. Because the green line goes below the relative risk of $1$ after a lag of about 5 days, the cumulative association after 15 days is nearly $1$. This means that the protective effects of $\textrm{PM}_{10}$ from lag 5 on have compensated the harmful effects from earlier lags. Whether that is scientifically reasonable is another quesiton.

Solved – How to get the survival duration prediction for each individual in the data by using the Kaplan-Meier method

👋Hi author of lifelines here. So what you asking is possible. The Kaplan-Meier curve gives you is a distribution of possible durations, where duration is the time between birth and death. However, given a player has existed for $N$ months, you can condition the survival function on $T > N$ to get a better estimate.

Let $S(t) = P(T \ge t)$ be the survival function. We are curious about $S(t | T \ge N)$.

$$ S(t | T\ge N) = \frac{P(T \ge t \text{ and } T \ge N)}{P(T \ge N)} = \frac{P(T \ge t)}{S(N)} = \frac{S(t)}{S(N)},\;\; t \ge N $$

So we simply need to divide the survival function by itself evaluated at the duration seen thus far.

In your use case, you could do something like:

predictions = pd.DataFrame(index=kmf.survival_function_.index)

for ix, row in alive_individuals.iterrows():
    T = row['T']
    predictions[ix] = kmf.survival_function_/kmf.survival_function_.loc[T]

# can't have probabilities great than 1. 
predictions[predictions > 1.0] = 1.

This gives you the new survival function. However, in lifelines, there is another utility you can use. kmf.conditional_time_to_event_ computes these conditional survival functions and then takes the median time remaining. Output using some fake data I have:

          KM_estimate - Conditional time remaining to event
timeline
0.0                                                    56.0
6.0                                                    50.0
7.0                                                    49.0
9.0                                                    47.0
13.0                                                   43.0
15.0                                                   41.0
17.0                                                   39.0
19.0                                                   37.0
22.0                                                   34.0
26.0                                                   32.0
29.0                                                   29.0
32.0                                                   26.0
33.0                                                   27.0
36.0                                                   24.0
38.0                                                   22.0
41.0                                                   19.0
43.0                                                   17.0
45.0                                                   15.0
47.0                                                   13.0
48.0                                                   13.0
51.0                                                   10.0
53.0                                                    8.0
54.0                                                    7.0
56.0                                                    7.0
58.0                                                    5.0
60.0                                                    8.0
61.0                                                    7.0
62.0                                                    6.0
63.0                                                    6.0
66.0                                                    3.0
68.0                                                    1.0
69.0                                                    6.0
75.0                                                    0.0

So if a player lives until age 62, we expect 6 more months left (6 months being the median time to death, given the player lived to 62). That may help as well.

Best Answer

Related Solutions

Solved – How to interpret the figure output from package dlnm in R

Interpretation of the graph in your case

Interpretation of the example given in the vignette of the dlnm packge

Solved – How to get the survival duration prediction for each individual in the data by using the Kaplan-Meier method

Related Question