Logistic Models in R – Understanding Predicted Values for clogit() Model via predict.coxph() and Calculating Survival Probability

cox-modellogisticrsurvival

Main questions
I'm using conditional logistic regression for a resource selection function (RSF) in an ecology study (more details below). I have no trouble implementing the code but I'm having difficulty grasping what the fitted values represent and how I can ultimately calculate probability of an event for given conditions. I understand that clogit() is actually a Cox Proportional Hazard model with constant time, and I have read up on these models, the help files, and a number of related posts, but so far I have not been able to find answers detailed enough for me to fully understand:

What does type "survival" return when using predict.coxph? It is not discussed in the help file. My assumption is that it is the probability of an event not occurring for a given row in the hypothetical next time step (as this is a constant time model), but I don't know if that is accurate?
How can I calculate predicted probability of event occurrence? I have seen it suggested that 1-survival should give me this value (which makes sense to me if I'm correct above), but I've also seen it stated that risk/1 + risk would provide this as well, which I do not understand.
More generally, when predicting from a stratified Cox, my understanding is that all predictions are relative? For example, with type "risk", I'd be getting the average change hazard ratio within each strata for given predictor values. Is that understanding correct?

I would greatly appreciate any help. Thank you!

Additional details if needed
My goal is to compare nest locations (cases) to random points (controls) for bird species with several continuous predictors and ultimately predict the probability of nest occurrence based on predictors. I use "nest_id" as a strata with the clogit() function from the survival package. The reason for this is that not all random points were available to all birds because of territoriality etc., and so traditional logistic regression would not be appropriate. Each strata consists of a known nest location and 5 random points within the territory that could have theoretically been selected by that nesting pair. Thus, in my case, the "event" would be nest occurrence, not mortality. Here is my code to illustrate:

# My data (rwbl.rsf)
# A tibble: 6 × 5
  nest_id    status ndvi_pix veg_pix wood_pix
  <chr>       <dbl>    <dbl>   <dbl>    <dbl>
1 aml0032022      1  0.235     1.39      231.
2 aml0032022      0  0.00778   1.06      232.
3 aml0032022      0  0.174     1.01      201.
4 aml0032022      0  0.306     1.42      207.
5 aml0032022      0 -0.0670    1.22      230.
6 aml0032022      0 -0.228     0.963     228.

# My model
model = clogit(status ~ scale(ndvi_pix) + scale(veg_pix) + scale(wood_pix) + strata(nest_id), 
               rwbl.rsf, method = "exact")

# Predictions
preds = predict(model, se.fit = T, type = "survival")
preds$fit[1:5]
        1         2         3         4         5 
0.7315548 0.9185614 0.8761767 0.6907805 0.9302214

Best Answer

When you ask for expected or survival predictions from a Cox model and don't specify newdata, then you get results directly from the original data and model. You can see what's going on by typing survival:::predict.coxph at the R command prompt. Here's the key code:

if (type == "expected") {
        if (missing(newdata)) 
            pred <- y[, ncol(y)] - object$residuals
...
}

Here y is the Surv object of the original data, whose last column is the set of event markers, and object$residuals are the corresponding martingale residuals, the differences between the observed number of events (here, 1 or 0) and the model-predicted values. Therneau and Grambsch have an accessible explanation of how martingales and martingale residuals play roles in the counting-process approach to Cox models.

Each pred value in the above code is thus the model-predicted number of events for that case at the corresponding observation time. (The code for clogit sets time=1 for all observations.) If you specify type="expected", that's what predict.coxph() returns directly.

As the predict.coxph help page says, "The survival probability for a subject is equal to exp(-expected)." That's a straightforward result for a continuous-time survival model, as the expected number of events at time $t$ is the cumulative hazard $H(t)$, in turn related to the survival probability by $S(t)=\exp(-H(t))$.

For your first question, you aren't quite right. What you get is the predicted survival probability if you repeated the experiment again (effectively to time=1). Cox models can make no predictions beyond the last observation time, and the way that clogit is coded all observation times are set to a value of 1.

For your second question, the event probability when there can be at most 1 event is 1 minus the survival probability. I'd recommend using that.

For your third question (related to part of your second question about using "risk"), you are correct that Cox models of "risk" are always relative to something. The choice of reference is thus critical. The Details section of the help page for predict.coxph explains the default choices, why they were made, and other options.

In your application, where stratification is a programming trick to fit a conditional logit model via a Cox model routine and you presumably care more about the regression-coefficient estimates than using them to make within-stratum predictions, I'd be reluctant to use the "risk" type of predictions at all. If you do want such predictions, read the help page very carefully first.

Related Solutions

Solved – Comparison of CPH, accelerated failure time model or neural networks for survival analysis

It depends on why you are making models. Two main reasons to construct survival models are (1) to make predictions or (2) to model effect sizes of covariates.

If you want to use them in a predictive setting in which you want to obtain an expected survival time given a set of covariates, neural networks are likely the best choice because they are universal approximators and make less assumptions than the usual (semi-)parametric models. Another option which is less popular but not less powerful is support vector machines.

If you are modelling to quantify effect sizes, neural networks won't be of much use. Both Cox proportional hazards and accelerated failure time models can be used for this goal. Cox PH models are by far the most widely used in clinical settings, in which the hazard ratio gives a measure of effect size for each covariate/interaction. In engineering settings, however, accelerated failure time (AFT) models are the weapon of choice.

Survival Probability Calculation – From Cox PH at Time t

The Cox proportional hazards model can be described as follows:

$$h(t|X)=h_{0}(t)e^{\beta X}$$ where $h(t)$ is the hazard rate at time $t$, $h_{0}(t)$ is the baseline hazard rate at time $t$, $\beta$ is a vector of coefficients and $X$ is a vector of covariates.

As you will know, the Cox model is a semi-parametric model in that it is only partially defined parametrically. Essentially, the covariate part assumes a functional form whereas the baseline part has no parametric functional form (it's form is that of a step function).

Additionally, the survival curve of the Cox model is:

$$\begin{align} S(t|X)&=\text{exp}\bigg(-\int_{0}^{t}h_{0}(t)e^{\beta X}\,dt\bigg)\\ &=\text{exp}\big(-H_{0}(t)\big)^{\text{exp}(\beta X)}\\ &=S_{0}(t)^{\text{exp}(\beta X)} \end{align}$$

where $H_{0}(t)=\int_{0}^{t}h_{0}(t)\,dt$, $S_{0}(t)=\text{exp}\big(-H_{0}(t)\big)$, $S(t)$ is the survival function at time $t$, $S_{0}(t)$ is the baseline survival function at time $t$ and $H_{0}(t)$ is the baseline cumulative hazard function at time $t$.

The R function basehaz() provides the estimated cumulative hazard function, $H_{0}(t)$, defined above. For example, I can fit a Cox PH model with a single covariate sexn in R as follows:

f=formula(sv~factor(sexn))
cox.fit=coxph(f)

I can then extract (and plot) the underlying baseline cumulative hazard function as follows:

bh=basehaz(cox.fit)
plot(bh[,2],bh[,1],main="Cumulative hazard function",xlab="Time",ylab="H0(t)")

Now, because of the proportional nature of the Cox model, to obtain the survival curves of the two groups defined by their sexn value I can just raise the cumulative hazard function to the power of the estimated coefficient for sexn.

For example, for my variable $sexn=\{0,1\}$, the two survival curves would be:

$$S(t|X=0)=\text{exp}(-H_{0}(t))^{\text{exp}(\beta(0))}=\text{exp}(-H_{0}(t))$$

and

$$S(t|X=1)=\text{exp}(-H_{0}(t))^{\text{exp}(\beta(1))}=\text{exp}(-H_{0}(t))^{\text{exp}(\beta)}$$

If you want to see the relative survival, you can just plot the curves as follows:

plot(bh[,2],exp(-bh[,1])^(exp(cox.fit$coef)),xlim=c(40,85),ylim=c(0,1),
col="red",main="Survival curves for two groups",xlab="Time",ylab="S(t|X)")
par(new=TRUE)
plot(bh[,2],exp(-bh[,1]),xlim=c(40,85),ylim=c(0,1),
col="blue",main="Survival curves for two groups",xlab="Time",ylab="S(t|X)")
legend("topright",c("sexn=1","sexn=0"),lty=c(1,1),col=c(2,4))

Thus, you can see that the group with $sexn=1$ has a lower survival than the group with $sexn=0$. If you want to measure the relative survival of the two groups you can do so in many ways. You can say that for two individuals (differing in only $sexn$) that start at $\text{Time}=40$, the individual with $sexn=1$ has a lower probability of surviving to any time $t>40$ compared with the individual with $sexn=0$.

I believe what you are trying to achieve is to calculate the survival estimate:

$$S(t=30|X)$$

This can be achieved by fitting a Cox model to a given survival object and applying the estimated coefficients to each individual depending on their individual covariates. This will scale the baseline survival curve and give you the desired survival estimate for each of your individuals.

Best Answer

Related Solutions

Solved – Comparison of CPH, accelerated failure time model or neural networks for survival analysis

Survival Probability Calculation – From Cox PH at Time t

Related Question