I have 4000 data records of survival data and 100 potential predictors/variables. My aim is to obtain the most important variables describing this data.
Since complete data records are needed I excluded
- all variables with more than 20% of missing values
- all Variables with high VIF values (>10) to avoid collinearity
- all variables that did not converge in a single Cox-analysis
Variables that did not correspond to the proportional hazard criterion, I tried to transform into a more normally distributed form, since I was told that in this way it is more likely the prop. hazard criterion is fulfilled. As you can see in the table further down, the model includes a variable log(F), which is such a transformed variable. After the transformation my checking function cox.zph did not "complain" about this variable anymore.
To reduce the number of Variables from the remaining 70 to the most important ones, I applied stepwise BIC selection for my cox regression using coxph.
As can be seen below the variables E and log(F) are highly significant, but especially the variable E has a very large confidence interval making me wonder how such a variable can be still significant for the model.
I was made aware of the fact that the stepwise selection I performed might have had a negative effect on the confidence intervals, although I am not sure how the relation here is, since I could have had selected these 6 variables from the very beginning.
I am trying my best to understand the mathematical foundations of proper model construction, but now I am stuck.
What would be a proper strategy to melt down the 70 variables to the most important ones for/in the final cox model?
I read that variable selection is always problematic, but I cannot run a full model (with all variables) due to missing values and me fearing overfitting.
I hope I was able to formulate my problem in an understandable way and am thankful for any advice.
Thanks in advance!
Mark
Best Answer
My book Regression Modeling Strategies, 2nd Edition has detailed strategies and case studies for model building and validation. Detailed course notes going along with the book may be found at http://biostat.mc.vanderbilt.edu/RmS#Materials
Your original strategy creates several difficulties including
Stepwise variable selection, whether based on $R^2$, AIC, BIC, $P$-values, $C_p$ is not a valid solution to your problem unless you incorporate shrinkage while doing variable selection (and that would only solve $\frac{1}{4}$ of the problem anyway.
As you reformulate the approach without doing variable selection or deletion of sometimes-missing predictors, consider making data reduction and multiple imputation key components of your strategy. Data reduction reduces the dimensionality of the predictor space in a way that is masked to the dependent variable. Therefore it does not create biases or multiplicity problems that work in your favor. Variable clustering and redundancy analysis are two forms of data reduction that I cover in my book and course notes.