Logistic Regression – Handling Too Many Logistic Regression Predictors

logisticlogitregression

I'm using logistic regression to look for association between independent variables and outcome (i.e. not to create a classifier). I have many variables, and a sample size of 391 (with 58 y=1 and the rest all y=0). By the rule of 10s, I am limited to 5 independent variables.

Is it possible to test associations by creating two models, one on a subset independent variables (IV1, IV2, IV3, IV4, and IV5) and the other on the remaining ones (IV6, IV7, IV8)? I would then test each model separately for over-fitting, etc.

My result could, theoretically, be IV3 has p-value 0.002 and IV8 has p-value 0.01 and I would conclude (assuming goodness-of-fit tests are good) that there is an association between outcome and IV3 as well as IV8. Does that make sense?

Best Answer

Your approach gives up one of the advantages of multiple regression: accounting for the combined influences of all the predictors at once. It's thus effectively throwing away information, which is seldom useful.

One way to deal with too many predictors is to use subject-matter knowledge or the observed relations among the predictors (not considering the outcomes) to combine some related predictors into a combined individual predictor.

Another way was suggested in the comment by @user777: use LASSO, ridge regression, or elastic net, which impose a penalty on regression coefficients that guards against overfitting. (The rule of thumb of 10 events per variable was based on non-penalized analyses.) These methods provide principled ways to build models even if you have more predictor variables than cases.

Note that the best-subset suggestion in another comment doesn't get around the overfitting issue, and the variables selected would be highly dependent on your particular data sample. Try repeating best-subset analysis on multiple bootstrap samples to see the problems.

If your interest is prediction and you only have 8 predictor variables, ridge regression will probably work well on your data.

Added in response to comments:

The paper linked from a comment, on assessing multivariable logistic regression models, rightly emphasizes the proper selection of predictor variables as a major criterion. Using subject-matter knowledge for selecting or combining variables should be a top priority. You might, for example, be able to combine categories in your categorical variable, or omit other predictors that have been shown in related studies not to be closely related to outcome.

That paper's sole focus on 10 events per variable to prevent overfitting, however, is inadequate in two ways. First, as a rule of thumb, 15 events per variable may be a better choice than 10. Second, not noted in that paper, methods like LASSO and ridge regression provide another well-established way to prevent overfitting, by shrinking the magnitudes of the coefficients to less than those that would be provided by standard logistic regression. See for example An Introduction to Statistical Learning for background on these and other approaches.

The idea to break your analysis into 2 parts (5-level categorical variable and then all other variables separately) doesn't really accomplish much. What you think you might gain from having about 15-20 events per variable in each of the 2 separate analyses would be lost by your need to correct for multiple hypothesis testing and your inability to take into account the levels of the categorical variable when evaluating the other variables (and vice-versa). And you would still need to evaluate overfitting.

With respect to "investigating associations" versus predictive modeling, consider what Frank Harrell has to say in Regression Modeling Strategies, second edition, page 3:

Thus when one develops a reasonable multivariable predictive model, hypothesis testing and estimation of effects are byproducts of the fitted model. So predictive modeling is often desirable even when prediction is not the main goal.

Harrell's rms package in R provides the tools you need to build, calibrate, and validate logistic models. His book linked above and associated class notes provide examples of ways to deal with too many variables. Try them out on your dataset.

Related Question