Cox Model – Comparing Cox Proportional Hazards Models: Variable Selection Analysis

cox-modelr

I am using a cox proportional hazards model to run a survival analysis in r on a number of non-nested, distinct covariates such as Age, Blood Type, Cancer, etc:

 A, B, C, D, E    

When I run the model on the omnibus null hypothesis:

surv ~ A + B + C + D    

The effects of all of the covariates are insignificant because the number of subjects that have measurements for every covariate is relatively small. However, when I isolate single or other combinations of covariates in different cox models:

surv ~ A    
surv ~ A + C
surv ~ B + D

I'm showing significant effects because the sample set is larger (i.e. the number of observations discarded by the model shrinks).

What I'm having difficulty understanding is how to do the following:

  • Comparing the different cox models for the best fit, i.e. is surv ~ A + B + D a better model than surv ~ A + C ? Should I be comparing the likelihood, wald or logrank scores?
  • Is it possible to run every possible combination of covariates to determine the best model? I have about 15 covariates.
  • More broadly, is this tactic the best approach to optimizing for both significant covariates and overall model "cost"? I will be attaching a cost to each distinct cox model i.e. using covariates A + B + C in the model costs \$100 while using covariates A + B costs \$75 and using only covariate A costs \$10. I'd like to look at the cost for each combination of covariates vs. the accuracy for each cox model.

Thanks very much for your help!

Best Answer

In general there is no reason to do variable selection. The model uncertainty and bias resulting from it are problematic. Insignificant variables are not tragic. And the data are incapable of telling which variables are "really" important. But if you have true costs of measuring variables, you can fit a well-defined sequence of models by adding variables in ascending order of cost, and stop when you have the best model for the money. There is little model uncertainty when using an apriori ordering of variables.