Solved – Alternatives to stepwise logistic regression with LARGE datasets

large datalogisticmodel selectionregressionsas

After reviewing related questions on Cross Validated and countless articles and discussions regarding the inappropriate use of stepwise regression for variable selection, I am still unable to find the answers that I am looking for in regards to building parsimonious, binary logistic regression models from datasets with 1000 (or more) potential predictor variables.

For some background information, I typically work with large datasets, 500k or more rows, and my interest is in building binary logistic regression models to predict whether an individual will pay (1) or not pay (0) their bill on a particular account without using stepwise logistic regression. Currently, stepwise logistic regression is hailed as the “perfect method” among other statisticians that I have worked with, and I would like to change that as I have witnessed many of its pitfalls firsthand.

I have recently dabbled in PCA (proc varclus) and random forest analyses (randomForest) with the latter being especially helpful; however, I am still seeking further direction on how to reduce the number of variables in my binary logistic models without using stepwise logistic regression. With that being said, any help (suggested articles or thoughts) is greatly appreciated. Thanks!

Best Answer

Tree-based methods (e.g., CART) are widely used for solving this problem. While I would always prefer to estimate a logit model using theory, when in situations where there is insufficient theory or the data is not well understood, a tree-based method is, in my experience, always preferable to a logit model as it scales better to large data sets, is much more robust and better deals with non-linearities and interactions. If you superiors are desperate for a logit model, you can use the variables selected by the tree-based model in a logit model.