Solved – Cox PH model selection and validation

cox-modelmodel selectionpredictive-models

I am trying to analyze my data using survival CoX PH in SPSS v.19 and also attempting to make different prediction models (without and with a biomarker of interest). I am a clinician (not a biostatistician) and this is for a paper that I need for completing my Ph.D.
For my model selection I used the following method. First, I identified the univariate predictors for the outcome, and then I used all of them in a Cox backward stepwise analysis to obtain my final model (I have only 25 events). Now, it was suggested to use some a priori variables (like for example age or sex – variables that weren't associated with the outcome in my univariate analysis and as such not included in the final model) and retain the variables only if they lead to an improved model fit (as assessed by LLR).

My first questions are:

  1. Is it statistically correct to use an a priori defined model and then add my statistically significant variables, comparing the LLR of the different models (using the hierarchical Cox analysis)?
  2. If this is correct, how do I initially select a model to which I then add another variable?
  3. How do I solve the problem of overfitting? Do I have to use bootstrapping for my final model?

I will also try to compare the final model with a model that includes my biomarker.

  1. How do I save the predicted probabilities for each model in order to then compare them (I need them to perform C statistic and NRI analysis)? In SPSS in simple logistic regression there is this option. Is there any macro that could do this?
  2. Is is correct to use the predictive probabilities from the simple logistic regression?

Best Answer

Your first method of bivariate screening and then backwards elimination is not recommended. These methods have numerous problems and have been extensively discussed here and elsewhere (search on "Stepwise" to find some).

Taking your questions one at a time:

Is it statistically correct to use an a priori defined model and then add my statistically significant variables, comparing the LLR of

the different models (using the hierarchical Cox analysis)?

This is not a bad method. It is certainly an improvement.

If this is correct, how do I initially select a model to which I then add another variable?

Based on theory, substantive knowledge, research questions etc.

How do I solve the problem of overfitting? Do I have to use bootstrapping for my final model?

If you have a lot of cases, then overfitting shouldn't be too much of an issue, but you can do things like use a training and test set if you are concerned.

Your latter questions seem to be more SPSS specific; I don't use SPSS.