Solved – How to do stepwise regression with a binary dependent variable

logisticstepwise regression

I want to use stepwise regression to reduce the number of variables. My dependent variable is a dummy variable (Fraud=1, None fraud=0) and I have 25 predictive variables. How can I do this?

Best Answer

Do not use step-wise regression.

Because step-wise regression almost certainly will insure biased results. All statistics produced through step-wise model building have a nested chain of invisible/unstated "conditional on excluding X" and/or "conditional on including X" statements built into them with the result that:

  • p-values are biased
  • variances are biased
  • parameter estimates are biased
  • Coefficients of determination are biased
  • false predictors are likely to be included
  • true predictors are likely to be excluded

What to use instead of step-wise regression

Use substantive theory to guide which predictor variables to include in your model, and report non-significant findings. If needed you can table only significant results in the main text of an article or report, and include the full model output in an appendix. But step-wise regression is more or less a good way to get consistently unreliable model results.

Some references on the topic
Babyak, M. A. (2004). What you see may not be what you get: A brief, nontechnical introduction to overfitting in regression-type models. Psychosomatic Medicine, 66:411–421.

Flom, P. L. and Cassell, D. L. (2007). Stopping stepwise: Why stepwise and similar selection methods are bad, and what you should use.

Henderson, D. A. and Denison, D. R. (1989). Stepwise regression in social and psychological research. Psychological Reports, 64:251–257.

Huberty, C. J. (1989). Problems with stepwise methods—better alternatives. Advances in Social Science Methodology, 1:43–70.

Hurvich, C. M. and Tsai, C.-L. (1990). The impact of model selection on inference in linear regression. The American Statistician, 44(3):214–217.

Malek, M. H. and Coburn, D. E. B. J. W. (2007). On the inappropriateness of stepwise regression analysis for model building and testing. European Journal of Applied Physiology, 101(2):263–264.

McIntyre, S. H., Montgomery, D. B., Srinivasan, V., and Weitz, B. A. (1983). Evaluating the statistical significance of models developed by stepwise regression. Journal of Marketing Research, 20(1):1–11.

Pope, P. T. and Webster, J. T. (1972). The use of an $F$-statistic in stepwise regression procedures. Technometrics, 14(2):327–340.

Rencher, A. C. and Pun, F. C. (1980). Inflation of $R^{2}$ in best subset regression. Technometrics, 22(1):49–53.

Romano, J. P. and Wolf, M. (2005). Stepwise multiple testing as formalized data snooping. Econometrica, 73(4):1237–1282.

Sribney, B., Harrell, F., and Conroy, R. (2011). Problems with stepwise regression.

Steyerberg, E. W., Eijkemans, M. J., and Habbema, J. D. F. (1999). Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis. Journal of clinical epidemiology, 52(10):935–942.

Thompson, B. (1995). Stepwise regression and stepwise discriminant analysis need not apply here: A guidelines editorial. Educational and Psychological Measurement, 55(4):525–534.

Whittingham, M., Stephens, P., Bradbury, R., and Freckleton, R. (2006). Why do we still use stepwise modelling in ecology and behaviour? Journal of Animal Ecology, 75(5):1182–1189.

Wilkinson, L. (1979). Tests of significance in stepwise regression. Psychological Bulletin, 86(1):168–174.