Solved – Interpretation of Confidence Intervals using Cox proportional hazards regression model (coxph)

confidence intervalcox-modelrsurvival

I am using the coxph function to model a Cox regression.
By using stepwise BIC selection I obtained an model with 6 variables.
One of the variables I had to transform using the logarithm to make it fulfill the proportional hazard criterion. All Variables are marked as high significant with very low p-values- What I am now confused about is the fact, that the confidence intervals of two variables are very high. Therefore I wonder how "useful" those variables are! Since they are selected by the algorithm I have no doubt that they are useful in a mathematical way, but what is the meaning of a variable whose 95% confidence interval is spanning over a wide range.

Cox Regression Result

As you can see the variables log(F) and especially E have a very high confidence interval. How can E be so important for the model (there have been more than 70 variables to choose from) and still be so "uncertainly" determined.
I hope I was able to formulate my problem in an understandable way.

Thanks in advance!
Mark

Best Answer

Much has been written on this site about the huge harm to statistical inference caused by stepwise variable selection. In particular, variable selection ruins the confidence intervals.

Transformations of predictors does not relate directly to the proportional hazards assumption; it relates to the regression shape assumption, i.e., how when varying $X$ the log hazard function changes for fixed time $t$. And it is doubtful that the log transformation is fully adequate, which is why analysts frequently use regression splines.

To get proper confidence intervals in the presence of damaging variable selection you could use the bootstrap, inserting zeros for non-selected regression coefficients. But I would try to start the project over using a pre-specified model based on subject matter expertise or using data reduction masked to the dependent variable. My book Regression Modeling Strategies and its course notes at http://biostat.mc.vanderbilt.edu/rms go into all this in great detail. I tend to pre-specify the number of knots in regression splines but to spline almost all continuous predictors, subject to having enough events per candidate variable.