I have a set of 40 genes expression data as binary variables (overexpressed yes or not) for predict recurrence in a kind of cancer in a sample of 77 patients, where 22 has recurred. This set is called discovery set. The objective is to find a subset of these genes to predict the recurrence in cancer. Times to recurrence and follow-up times for censored observations were saved.
In a first step, I have fitted univariate Cox regression model, and I have discarded 13 genes of the original set of 40 genes, removing those that have P-values > 0.25 in the LRT. Then, I work only with 27 genes.
In the second step, I am analyzing all the possible multivariate models of 3 variables, 4 variables, 5 variables, etc. I decide what is the more predictive model of $X$ variables fitting multivariate Cox regression and calculating the average c-index of 100 times 5-folder CV. For calculate the c-index I use the “index.corrected” reported by the validate
function of the Harrell’s rms
package, and I calculate the average of the 100 times I run the CV.
I will use the AIC for compare the best model with $X+1$ variables and the best model with $X$ variable, to decide what it is the best predictive model.
Finally, I have an independent validation set of patients ($n=105$) where I will check the model. I will use the linear predictor of the best model in the discovery set as prognostic index, dividing in tertiles of risk, and calculating the values of the prognostic index of these new patients. I will analyze statistical differences of the 3 groups with the log-rank test, HR with 95CI% and KM Plots.
Best Answer
There are a number of serious problems with your approach.
rms
package has a few methods for estimating continuous calibration curves. See for example the functioncalibrate
.