Solved – Backward selection for Cox model using R

cox-modelrstepwise regression

I want to perform an exploratory Cox regression analysis of medical data using R. I am practicing using the pbc data from the survival function.

Would you recommend performing a backward selection multivariate analysis? Are there any summary data / tables I should create for covariates before modelling? Are there any model diagnostics I should perform? And what would be the consequence of doing this?

I would be very grateful for your help and examples using R; also easy to understand literature recommendations (paper, book, and so on) would be nice.


To renew my former question: I understand that a stepwise backward regression will lead to inflated coefficients, deflated p-values, and inflated model fit statistics. However, this approach is very common in medical reports. Would it be possible to draw the conclusion that a covariate is independently associated with an outcome, irrespective the above mentioned drawbacks? And when yes, how reliable would it be?

And again being a Little afraid to ask this what would be the best way in R to perform such an analysis?

Best Answer

I would recommend not performing stepwise model building, unless you are looking for biased (inflated) coefficients, biased (deflated) p-values, and inflated model fit statistics.

The fundamental problem is that all of the inferences in one's final model carry a typically invisible/silent and usually uninterpretable series of "conditional upon all these other choices based on other variables in some order" statements.


References
Babyak, M. A. (2004). What you see may not be what you get: A brief, nontechnical introduction to overfitting in regression-type models. Psychosomatic Medicine, 66:411–421.

Henderson, D. A. and Denison, D. R. (1989). Stepwise regression in social and psychological research. Psychological Reports, 64:251–257.

Huberty, C. J. (1989). Problems with stepwise methods—better alternatives. Advances in Social Science Methodology, 1:43–70.

Hurvich, C. M. and Tsai, C.-L. (1990). The impact of model selection on inference in linear regression. The American Statistician, 44(3):214–217.

Lovell, M. C. (1983). Data mining. The Review of Economics and Statistics, 65(1):1–12.

Malek, M. H. and Coburn, D. E. B. J. W. (2007). On the inappropriateness of stepwise regression analysis for model building and testing. European Journal of Applied Physiology, 101(2):263–264.

McIntyre, S. H., Montgomery, D. B., Srinivasan, V., and Weitz, B. A. (1983). Evaluating the statistical significance of models developed by stepwise regression. Journal of Marketing Research, 20(1):1–11.

Pope, P. T. and Webster, J. T. (1972). The use of an $F$-statistic in stepwise regression procedures. Technometrics, 14(2):327–340.

Rencher, A. C. and Pun, F. C. (1980). Inflation of R$^2$ in best subset regression. Technometrics, 22(1):49–53.

Romano, J. P. and Wolf, M. (2005). Stepwise multiple testing as formalized data snooping. Econometrica, 73(4):1237–1282.

Sribney, B., Harrell, F., and Conroy, R. (2011). Problems with stepwise regression.

Steyerberg, E. W., Eijkemans, M. J., and Habbema, J. D. F. (1999). Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis. Journal of clinical epidemiology, 52(10):935–942.

Thompson, B. (1995). Stepwise regression and stepwise discriminant analysis need not apply here: A guidelines editorial. Educational and Psychological Measurement, 55(4):525–534.

Wilkinson, L. (1979). Tests of significance in stepwise regression. Psychological Bulletin, 86(1):168–174.