Solved – Building Cox Proportional Hazards Models and Getting Accurate Confidence Intervals

cox-modelfeature selectionmodel selectionregression-strategiessurvival

I have 4000 data records of survival data and 100 potential predictors/variables. My aim is to obtain the most important variables describing this data.
Since complete data records are needed I excluded

  • all variables with more than 20% of missing values
  • all Variables with high VIF values (>10) to avoid collinearity
  • all variables that did not converge in a single Cox-analysis

Variables that did not correspond to the proportional hazard criterion, I tried to transform into a more normally distributed form, since I was told that in this way it is more likely the prop. hazard criterion is fulfilled. As you can see in the table further down, the model includes a variable log(F), which is such a transformed variable. After the transformation my checking function cox.zph did not "complain" about this variable anymore.

To reduce the number of Variables from the remaining 70 to the most important ones, I applied stepwise BIC selection for my cox regression using coxph.
As can be seen below the variables E and log(F) are highly significant, but especially the variable E has a very large confidence interval making me wonder how such a variable can be still significant for the model.
enter image description here

I was made aware of the fact that the stepwise selection I performed might have had a negative effect on the confidence intervals, although I am not sure how the relation here is, since I could have had selected these 6 variables from the very beginning.

I am trying my best to understand the mathematical foundations of proper model construction, but now I am stuck.

What would be a proper strategy to melt down the 70 variables to the most important ones for/in the final cox model?

I read that variable selection is always problematic, but I cannot run a full model (with all variables) due to missing values and me fearing overfitting.

I hope I was able to formulate my problem in an understandable way and am thankful for any advice.

Thanks in advance!
Mark

Best Answer

My book Regression Modeling Strategies, 2nd Edition has detailed strategies and case studies for model building and validation. Detailed course notes going along with the book may be found at http://biostat.mc.vanderbilt.edu/RmS#Materials

Your original strategy creates several difficulties including

  1. Removal of predictors that have missing values as opposed to using multiple imputation
  2. Usage of stepwise variable selection which ruins confidence intervals even more than it ruins point estimates (regression coefficients)
  3. Conflating predictor transformation with predictor $\times$ time interaction (non-proportional hazards) and assuming that the distribution of predictors has something to do with the quality of their fit in the model (including being related to proportional hazards being satisfied).
  4. Arbitrary removal of co-linear predictors instead of getting better predictions by pre-combining co-linear predictors using methods such as principal components
  5. Assuming that the only transformations of predictors that fit the data are the identity and log transformations
  6. Using log transformations as opposed to more general and better fitting regression splines (which also allow zeros and negative values in predictors)
  7. Having far too many separate candidate variables for fitting a full model or for using stepwise variable selection without penalization. You didn't state a key number which is the number of events. You need about 15 events per candidate predictor for a model to be reliable.
  8. Univariate analysis should not be used to select variables for modeling. This results in huge biases and inefficiencies.

Stepwise variable selection, whether based on $R^2$, AIC, BIC, $P$-values, $C_p$ is not a valid solution to your problem unless you incorporate shrinkage while doing variable selection (and that would only solve $\frac{1}{4}$ of the problem anyway.

As you reformulate the approach without doing variable selection or deletion of sometimes-missing predictors, consider making data reduction and multiple imputation key components of your strategy. Data reduction reduces the dimensionality of the predictor space in a way that is masked to the dependent variable. Therefore it does not create biases or multiplicity problems that work in your favor. Variable clustering and redundancy analysis are two forms of data reduction that I cover in my book and course notes.