Predictive Models – Best Predictive Cox Regression Model Using C-Index and Cross-Validation

cox-modelcross-validationpredictionpredictive-modelssurvival

I have a set of 40 genes expression data as binary variables (overexpressed yes or not) for predict recurrence in a kind of cancer in a sample of 77 patients, where 22 has recurred. This set is called discovery set. The objective is to find a subset of these genes to predict the recurrence in cancer. Times to recurrence and follow-up times for censored observations were saved.

In a first step, I have fitted univariate Cox regression model, and I have discarded 13 genes of the original set of 40 genes, removing those that have P-values > 0.25 in the LRT. Then, I work only with 27 genes.

In the second step, I am analyzing all the possible multivariate models of 3 variables, 4 variables, 5 variables, etc. I decide what is the more predictive model of $X$ variables fitting multivariate Cox regression and calculating the average c-index of 100 times 5-folder CV. For calculate the c-index I use the “index.corrected” reported by the validate function of the Harrell’s rms package, and I calculate the average of the 100 times I run the CV.

I will use the AIC for compare the best model with $X+1$ variables and the best model with $X$ variable, to decide what it is the best predictive model.

Finally, I have an independent validation set of patients ($n=105$) where I will check the model. I will use the linear predictor of the best model in the discovery set as prognostic index, dividing in tertiles of risk, and calculating the values of the prognostic index of these new patients. I will analyze statistical differences of the 3 groups with the log-rank test, HR with 95CI% and KM Plots.

Best Answer

There are a number of serious problems with your approach.

  1. You used univariable screening which is known to have a number of problems
  2. You took the variables that passed the screen and fitted a model, then used cross-validation (CV) to validate the model without informing CV of the univariable screening step. This creates a major bias. CV needs to repeat all modeling steps that made use of $Y$.
  3. You are using a non-optimum optimality criterion (the $c$-index; generalized ROC area) whereas the most sensitive criterion will be likelihood-based
  4. Your total sample size is too low by perhaps a factor of 100 for data splitting to be a reliable model validation approach. You will find that if you were to split the training and test samples again you will get completely different genetic signatures and validation performance.
  5. Your original discovery set with 22 events is only large enough to assess a single pre-specified gene.
  6. If the raw data were continuous gene expression values, it is completely invalid to dichotomize at "over expressed".
  7. Validating a model by binning predictions (in your case tertiles of predicted risk) is arbitrary and has very low resolution. The R rms package has a few methods for estimating continuous calibration curves. See for example the function calibrate.