Logistic Models in R – Understanding Predicted Values for clogit() Model via predict.coxph() and Calculating Survival Probability

cox-modellogisticrsurvival

Main questions
I'm using conditional logistic regression for a resource selection function (RSF) in an ecology study (more details below). I have no trouble implementing the code but I'm having difficulty grasping what the fitted values represent and how I can ultimately calculate probability of an event for given conditions. I understand that clogit() is actually a Cox Proportional Hazard model with constant time, and I have read up on these models, the help files, and a number of related posts, but so far I have not been able to find answers detailed enough for me to fully understand:

  1. What does type "survival" return when using predict.coxph? It is not discussed in the help file. My assumption is that it is the probability of an event not occurring for a given row in the hypothetical next time step (as this is a constant time model), but I don't know if that is accurate?

  2. How can I calculate predicted probability of event occurrence? I have seen it suggested that 1-survival should give me this value (which makes sense to me if I'm correct above), but I've also seen it stated that risk/1 + risk would provide this as well, which I do not understand.

  3. More generally, when predicting from a stratified Cox, my understanding is that all predictions are relative? For example, with type "risk", I'd be getting the average change hazard ratio within each strata for given predictor values. Is that understanding correct?

I would greatly appreciate any help. Thank you!

Additional details if needed
My goal is to compare nest locations (cases) to random points (controls) for bird species with several continuous predictors and ultimately predict the probability of nest occurrence based on predictors. I use "nest_id" as a strata with the clogit() function from the survival package. The reason for this is that not all random points were available to all birds because of territoriality etc., and so traditional logistic regression would not be appropriate. Each strata consists of a known nest location and 5 random points within the territory that could have theoretically been selected by that nesting pair. Thus, in my case, the "event" would be nest occurrence, not mortality. Here is my code to illustrate:

# My data (rwbl.rsf)
# A tibble: 6 × 5
  nest_id    status ndvi_pix veg_pix wood_pix
  <chr>       <dbl>    <dbl>   <dbl>    <dbl>
1 aml0032022      1  0.235     1.39      231.
2 aml0032022      0  0.00778   1.06      232.
3 aml0032022      0  0.174     1.01      201.
4 aml0032022      0  0.306     1.42      207.
5 aml0032022      0 -0.0670    1.22      230.
6 aml0032022      0 -0.228     0.963     228.

# My model
model = clogit(status ~ scale(ndvi_pix) + scale(veg_pix) + scale(wood_pix) + strata(nest_id), 
               rwbl.rsf, method = "exact")

# Predictions
preds = predict(model, se.fit = T, type = "survival")
preds$fit[1:5]
        1         2         3         4         5 
0.7315548 0.9185614 0.8761767 0.6907805 0.9302214 

Best Answer

When you ask for expected or survival predictions from a Cox model and don't specify newdata, then you get results directly from the original data and model. You can see what's going on by typing survival:::predict.coxph at the R command prompt. Here's the key code:

if (type == "expected") {
        if (missing(newdata)) 
            pred <- y[, ncol(y)] - object$residuals
...
}

Here y is the Surv object of the original data, whose last column is the set of event markers, and object$residuals are the corresponding martingale residuals, the differences between the observed number of events (here, 1 or 0) and the model-predicted values. Therneau and Grambsch have an accessible explanation of how martingales and martingale residuals play roles in the counting-process approach to Cox models.

Each pred value in the above code is thus the model-predicted number of events for that case at the corresponding observation time. (The code for clogit sets time=1 for all observations.) If you specify type="expected", that's what predict.coxph() returns directly.

As the predict.coxph help page says, "The survival probability for a subject is equal to exp(-expected)." That's a straightforward result for a continuous-time survival model, as the expected number of events at time $t$ is the cumulative hazard $H(t)$, in turn related to the survival probability by $S(t)=\exp(-H(t))$.

For your first question, you aren't quite right. What you get is the predicted survival probability if you repeated the experiment again (effectively to time=1). Cox models can make no predictions beyond the last observation time, and the way that clogit is coded all observation times are set to a value of 1.

For your second question, the event probability when there can be at most 1 event is 1 minus the survival probability. I'd recommend using that.

For your third question (related to part of your second question about using "risk"), you are correct that Cox models of "risk" are always relative to something. The choice of reference is thus critical. The Details section of the help page for predict.coxph explains the default choices, why they were made, and other options.

In your application, where stratification is a programming trick to fit a conditional logit model via a Cox model routine and you presumably care more about the regression-coefficient estimates than using them to make within-stratum predictions, I'd be reluctant to use the "risk" type of predictions at all. If you do want such predictions, read the help page very carefully first.

Related Question