Solved – To select variables or not in logistic regression

feature selectionlogisticpredictionregression-strategies

I am trying to find predictors for an outcome. I was taught to perform univariate analyses & put significant variables into a multivariate logistic regression model. Then I remove variables one by one based on p values > 0.05 to obtain the final model.

I saw from some papers that there is another approach. Basically, they do not remove any variable from the multivariate model, adjusting for all.

The first appraoch may not adjust for some potential confounders, but you get a model with less variables, all of which are significant. The second approach adjusts for all, which could be quite a long list. Are there any other important advantages or disadvantages that I should be aware of between the two approaches?

Best Answer

Approaches that naively select model terms based on some p-values or some AIC cut-offs (either in a multivariate model via some kind of stepwise or other selection or by looking at lots of univariate models) lead to extremely problematic fits that may fit the particular dataset well, but will otherwise not be useful. Models constructed in such a fashion tend to wrongly identify variables as relevant that are not (while not identifying truly relevant variables - if we assume the used model is some reasonable approximation to nature, in which some variables are relevant and some are not) and have poor predictive properties on new datasets. Nevertheless such approaches are still often used and one can even occasionally get such work published in some well-respected journals, but are quite thoroughly discredited in the statistical community. There are a lot of more appropriate approaches, e.g. bootstrapping naive model building approaches, cross-validation, random forests, model averaging, variable selection priors etc. that should be used instead.