Solved – Survival analysis for event prediction

classificationsurvival

For each record in my datasets I have the following information

$$ (X_1 \ , \dots \ , X_m \ , \delta \ , T \ )$$

where $X_i$ are features, $\delta$ is 1 if the target event occurs and 0 otherwise, and $T$ is the timestamp of the occurred event. In particular, $T$ might be missing if there was no event or set to time the follow-up ended.

I want to compute a risk index for each record in my dataset.

I was thinking to go for a classification model that uses features $X_i$ to predict the class $\delta$. However, $T$ is important: if the event $\delta$ is likely to occur soon the risk should be higher.

That is why a survival analysis should be suited for this problem. I don't need the full estimation of the $S(t) = P(T>t)$ but just a single index that represents the risk for a single record.

The mean survival time, that can be computed for each record, seems a nice risk index – the lower the higher the risk is.

My question are:

  1. Is the survival analysis suited for my purposes?
  2. How can I evaluate the performance of my model?

About question (2): I am keen to use the Harrell's $c$-index for example, but I am not sure about which predicted outcome is used to compute it. From Harrell's book Regression Modeling Strategies page 247:

The $c$ index […] is computed by taking all possible pairs of subjects such that one subject responded and the other did not. The index is the proportion of such pairs with the responder having a higher predicted probability of response than the non responder.

If the survival analysis turns out to be a right choice I think it should be easy to use some standard method to introduce time varying covariates $X_i(t)$.

Best Answer

Is the survival analysis suited for my purposes?

The only thing that makes this seem less applicable for survival analysis is:

... $TT$ might be missing if there was no event or set to time the follow-up ended.

You will need to know the last period the individual was observed to be alive at for most models. Otherwise it should be straightforward and applicable to use survival analysis. E.g. Cox proportional hazard with survival::coxph in R or a parametric models with survival::survreg.

The mean survival time, that can be computed for each record, seems a nice risk index - the lower the higher the risk is.

Yes, you can use the mean survival times or just the linear predictor for the two former mentioned (classes of) models.

How can I evaluate the performance of my model?

The $c$ index seems like a sensible choice to me as "natural" generalization of the AUC. Note that is implemented in R with e.g. Hmisc::rcorr.cens.