I am working with a dataset of 1000 individuals, 200 of which are disease positive. I have run a logistic regression with 25 predictors to identify overall which variables are significantly predictive. Straightforward…
However, I also want to identify which variables account for the greatest amount of variability for males vs. females, and see if there are differences in which variables pop. I considered modeling gender x predictor interaction terms, but that essentially doubles my number of predictors. I proceeded with a forward logistic regression and what I noticed was that by the last iteration, the model correctly identified a high percentage of non-disease group (>95%) but was very poor in correctly identifying the disease group. If anything, I would prefer a false-positive model (for clinical reasons)!
So I played around and took a random sample of 200 from the non-disease group and ran analyses with those individuals and found that the final iteration of the forward LR correctly predicted a high percentage of both groups. Therefore it seemed that using the whole sample yielded a model biased toward the larger group.
In reading through these pages and other sources, it seems that sub-sampling isn't viewed positively regarding LR, but I could not find anything about using it in an iterative, stepwise procedure.
So my questions are:
1) Is sub-sampling acceptable for a stepwise LR with such a disparate proportion of dichotomous variable?
2) If not, what other procedure(s) should I consider? (e.g., exact logistic regression?)
Best Answer
The simple answer is No. Subsampling will not help.
If by subsampling you mean a balanced sample so that the ratio of events changes from 200/1000 to 200/400. This is only used in classification models and is of no use (generally) in maximum-likelihood / probability models.
What the comments are trying to suggest is that there are many other larger issues revealed in questions that could be textbook chapters by themselves: