After fitting a coxmodel it is possible to make predictions and retrieve the relative risk of new data. What I don't understand is how the relative risk is computed for an individual and what is it relative to (i.e. the average of the population)? Any recommendations for resources to help understand (I not very advanced in survival analysis so the simpler the better)?
Solved – How to interpret the output of predict.coxph
cox-modelpredictive-modelsrelative-risk
Related Solutions
Interpretation of the graph in your case
Note: The y-axis is not always the relative risk as in the example given in the vignette of the dlnm
package. This is only the case in their example, because they used mortality data and Poisson regression models. In their framework, the exponentiated regression coefficient from the Poisson models $RR=\exp(\hat{\beta})$ is the relative risk. This is analogous to exponentiating the regression coefficients in logistic regression, which is the odds ratio.
Can I still use the model?
Yeah, you can still use such a model.
Let's summarize what you do:
- You use natural cubic B-splines as basis functions instead of
polynomials to model the relationship between temperature and
$\mathrm{CO}_{2}$ (
arglag
with optiontype="ns"
instead oftype="poly"
) - You assume that the effect of temperature is non-linear, as you
specify
argvar
as splines. One important thing you have to know for the interpretation of the plots is that the functioncrossbasis
automatically centeres the values at the predictor mean (i.e. the mean temperature) if not specified otherwise. This is the reference value with which the predictions are later compared in the graphics. - You consider lags up to 12 (option
lag=12
incrossbasis
). (Btw: Why do you suppress the warnings?) - You calculate a GLM with Gaussian errors and the identity link function, which is equivalent to a simple linear regression (OLS). You could have used the
lm
function instead.
The plot that you have provided is interpreted as follows: The x-axis is the lag.
Interpretation of the values of the y-axis: The y-axis depicts the changes in $\mathrm{CO}_{2}$ concentration associated with an increase of 10, 20 or 30°C compared to the mean temperature. If predicted change is 0, this means that an increase in temperature is not associated with an increase in $\mathrm{CO}_{2}$ concentration compared to $\mathrm{CO}_{2}$ concentration at mean temperature: The predicted $\mathrm{CO}_{2}$ concentration is the same at $\bar{x}_{Temp}+z$ degrees (where $z$ is any amount, say 10 or 20 degrees) and at mean temperature $\bar{x}_{Temp}$.
This means that for an increase in temperature of 10°C, the temperature at lag 0 (on the same hour) increases the $\mathrm{CO}_{2}$ concentration compared to the mean temperature. Because you specified cumul=TRUE
in crosspred
, the effects are cumulative. The cumulative effects of an increase of 10°C are quasi nonexistent after 4 hours compared to the mean temperature. This suggests that the non-cumulative effects are negative at lags 1-4 and null effects from that on.
For temperature increases of 20 or 30°C, the cumulative effects on the $\mathrm{CO}_{2}$ concentration are lower in the first 1-4 hours compared to $\mathrm{CO}_{2}$ at the mean temperature. As with temperature increases of 10°C, the cumulative effects are practically nonexistent after 4 or 5 hours. Again: $\mathrm{CO}_{2}$ concentrations are the same at mean temperature and at an increase in temperature of 20 or 30°C after 4 or 5 hours.
I think a contour plot would be easier to interpret. Try the following code:
plot(cp, xlab="Temperature", col="red", zlab="CO2", shade=0.6,
main="3D graph of temperature effect")
Interpretation of the example given in the vignette of the dlnm packge
First, a little something about distributed lag models. They have the form:
$$ Y_{i}=\alpha + \sum_{l=0}^{K}\beta_{j}x_{t-l} + \text{other predictors} +\epsilon_{i} $$ where $K$ is the maximum lag and $x$ is a predictor. This is just fitted using a multiple linear regression. So the coefficient $\beta_{1}$ would estimate the effect of $x_{t-1}$ of the day before on $Y_{t}$. In essence, multiple lags of the predictors are included in the model simultaneously. This obviously has the problem that the lagged predictors are highly correlated (autocorrelation).
A more advanced method are polynomial distributed lag models. It has the same basic formula as above, but the impulse-response function is forced to lie on a polynomial of degree $q$ (link to a paper for Stata):
$$ \beta_{i} = a_{0} + a_{1}i + a_{2}i^2 +\ldots+a_{q}i^q $$ where $q$ is the degree of the polynomial and $i$ the lag length. Another formulation is $$ \beta_{i} = a_{0} + \sum_{j=1}^{q}a_{j}f_{j}(i) $$
Where $f_{j}(i)$ is a polynomial of degree $j$ in the lag length $i$. A good introduction to the dlnm
package and polynomial distributed lag models can be found here.
These models are often used in studies about air-pollution and health because air-pollution has lagged effects on health outcomes.
Let's look at this graph from the vignette of the dlnm
package (page 13):
The degree of the polynomials was $q=4$ in this case so the green line is a polynomial of 4th degree. The y-axis is the relative risk (RR) estimated via Poisson regression and the x-axis the considered lag. The relative risk has the following interpretation: Persons who were exposed have a $(RR-1)\cdot100\%$ higher/lower chance of getting the outcome (e.g. death, lung cancer, etc.) compared to people who were not exposed. If $RR>1$ this means a positive association and if $RR<1$ means a protective association. A $RR=1$ means no association. We see that for every increase of $\textrm{PM}_{10}$ by 10 units ($\mu \mathrm{g}/m^{3}$), there is a $(1.001-1)\cdot100\%=0.1\%$ increase in the risk to die at lag 0 (i.e. on the same day as the exposure). Strangely, the exposure from about 9 days ago is protective: an increase of 10$~\mu \mathrm{g}/m^{3}$ is associated with a decreased risk to die compared to people with 10 units less exposure. We can also see that the exposure from 15 days before doesn't play a role (i.e. $RR\approx1$).
Let's look at the cumulative relative risk:
This is the same as before but the effects are cumulated over time (i.e. summing all contributions from the lags up to the maximum lag). The red line starts at the same point as the green line in the first graphic (i.e. $\approx1.001$). We can see that people who have been exposed for five days have an increased cumulative risk of about $(1.005-1)\cdot100\%=0.5\%$ to die compared to non-exposed people. Because the green line goes below the relative risk of $1$ after a lag of about 5 days, the cumulative association after 15 days is nearly $1$. This means that the protective effects of $\textrm{PM}_{10}$ from lag 5 on have compensated the harmful effects from earlier lags. Whether that is scientifically reasonable is another quesiton.
👋Hi author of lifelines here. So what you asking is possible. The Kaplan-Meier curve gives you is a distribution of possible durations, where duration is the time between birth and death. However, given a player has existed for $N$ months, you can condition the survival function on $T > N$ to get a better estimate.
Let $S(t) = P(T \ge t)$ be the survival function. We are curious about $S(t | T \ge N)$.
$$ S(t | T\ge N) = \frac{P(T \ge t \text{ and } T \ge N)}{P(T \ge N)} = \frac{P(T \ge t)}{S(N)} = \frac{S(t)}{S(N)},\;\; t \ge N $$
So we simply need to divide the survival function by itself evaluated at the duration seen thus far.
In your use case, you could do something like:
predictions = pd.DataFrame(index=kmf.survival_function_.index)
for ix, row in alive_individuals.iterrows():
T = row['T']
predictions[ix] = kmf.survival_function_/kmf.survival_function_.loc[T]
# can't have probabilities great than 1.
predictions[predictions > 1.0] = 1.
This gives you the new survival function. However, in lifelines, there is another utility you can use. kmf.conditional_time_to_event_
computes these conditional survival functions and then takes the median time remaining. Output using some fake data I have:
KM_estimate - Conditional time remaining to event
timeline
0.0 56.0
6.0 50.0
7.0 49.0
9.0 47.0
13.0 43.0
15.0 41.0
17.0 39.0
19.0 37.0
22.0 34.0
26.0 32.0
29.0 29.0
32.0 26.0
33.0 27.0
36.0 24.0
38.0 22.0
41.0 19.0
43.0 17.0
45.0 15.0
47.0 13.0
48.0 13.0
51.0 10.0
53.0 8.0
54.0 7.0
56.0 7.0
58.0 5.0
60.0 8.0
61.0 7.0
62.0 6.0
63.0 6.0
66.0 3.0
68.0 1.0
69.0 6.0
75.0 0.0
So if a player lives until age 62, we expect 6 more months left (6 months being the median time to death, given the player lived to 62). That may help as well.
Best Answer
Edit: the following description applies to
survival
versions 3.2-8 and below. Starting with version 3.2-9, the default behavior ofpredict.coxph()
changes with respect to treating 0/1 (dummy indicator) variables. See NEWS.predict.coxph()
computes the hazard ratio relative to the sample average for all $p$ predictor variables. Factors are converted to dummy predictors as usual whose average can be calculated. Recall that the Cox PH model is a linear model for the log-hazard $\ln h(t)$:$$ \ln h(t) = \ln h_{0}(t) + \beta_{1} X_{1} + \dots + \beta_{p} X_{p} = \ln h_{0}(t) + \bf{X} \bf{\beta} $$
Where $h_{0}(t)$ is the unspecified baseline hazard. Equivalently, the hazard $h(t)$ is modeled as $h(t) = h_{0}(t) \cdot e^{\beta_{1} X_{1} + \dots + \beta_{p} X_{p}} = h_{0}(t) \cdot e^{\bf{X} \bf{\beta}}$. The hazard ratio between two persons $i$ and $i'$ with predictor values $\bf{X}_{i}$ and $\bf{X}_{i'}$ is thus independent of the baseline hazard and independent of time $t$:
$$ \frac{h_{i}(t)}{h_{i'}(t)} = \frac{h_{0}(t) \cdot e^{\bf{X}_{i} \bf{\beta}}}{h_{0}(t) \cdot e^{\bf{X}_{i'} \bf{\beta}}} = \frac{e^{\bf{X}_{i} \bf{\beta}}}{e^{\bf{X}_{i'} \bf{\beta}}} $$
For the estimated hazard ratio between persons $i$ and $i'$, we just plug in the coefficient estimates $b_{1}, \ldots, b_{p}$ for the $\beta_{1}, \ldots, \beta_{p}$, giving $e^{\bf{X}_{i} \bf{b}}$ and $e^{\bf{X}_{i'} \bf{b}}$.
As an example in R, I use the data from John Fox' appendix on the Cox-PH model which provides a very nice introductory text. First, we fetch the data and build a simple Cox-PH model for the time-to-arrest of released prisoners (
fin
: factor - received financial aid with dummy coding"no"
-> 0,"yes"
-> 1,age
: age at the time of release,prio
: number of prior convictions):Now we plug in the sample averages for our predictors into the $e^{\bf{X} \bf{b}}$ formula:
Now we plug in the predictor values of the first 4 persons into the $e^{\bf{X} \bf{b}}$ formula.
Now calculate the relative risk for the first 4 persons against the sample average and compare to the output from
predict.coxph()
.If you have a stratified model, the comparison in
predict.coxph()
is against the strata-averages, this can be controlled via thereference
option that is explained in the help page.