Logistic Regression – Insights into Stepwise Logistic Regression and Sampling

logisticspssstepwise regression

I am fitting a stepwise logistic regression on a set of data in SPSS. In the procedure, I am fitting my model to a random subset that is approx. 60% of the total sample, which is about 330 cases.

What I find interesting is that every time I re-sample my data, I am getting different variables popping in and out in the final model. A few predictors are always present in the final model, but others pop in and out depending on the sample.

My question is this. What is the best way to handle this? I was hoping to see the convergence of predictor variables, but that isn't the case. Some models make much more intuitive sense from an operational view (and would be easier to explain to the decision makers), and others fit the data slightly better.

In short, since variables are shuffling around, how would you recommend dealing with my situation?

Many thanks in advance.

Best Answer

If you're going to use a stepwise procedure, don't resample. Create one random subsample once and for all. Perform your analysis on it. Validate the results against the held-out data. It's likely most of the "significant" variables will turn out not to be significant.

(Edit 12/2015: You can indeed go beyond such a simple approach by resampling, repeating the stepwise procedure, and re-validating: this will lead you into a form of cross-validation. But in such a case more sophisticated methods of variable selection, such as ridge regression, the Lasso, and the Elastic Net are likely preferable to stepwise regression.)

Focus on the variables that make sense, not those that fit the data a little better. If you have more than a handful of variables for 330 records, you're at great risk of overfitting in the first place. Consider using fairly severe entering and leaving criteria for the stepwise regression. Base it on AIC or $C_p$ instead of thresholds for $F$ tests or $t$ tests.

(I presume you have already carried out the analysis and exploration to identify appropriate re-expressions of the independent variables, that you have identified likely interactions, and that you have established that there really is an approximately linear relationship between the logit of the dependent variable and the regressors. If not, do this essential preliminary work and only then return to the stepwise regression.)

Be cautious about following generic advice like I just gave, by the way :-). Your approach should depend on the purpose of the analysis (prediction? extrapolation? scientific understanding? decision making?) as well as the nature of the data, the number of variables, etc.