Logistic Regression – Treating Strongly Correlated Covariates

logisticmulticollinearitymultiple regression

I have to build a multiple logistic regression model with two strongly correlated covariates (predictor variables). How should they be treated? Am I to exclude one of them from the regression?

There is also a covariate that is a logical (Boolean) variable and is TRUE only for 2 observations (of 107 total). Am I to exclude this covariate from the regression, or is there any sense of taking the covariate into account?

I have 6 covariates and 107 observations. All covariates are Boolean, but may be I will replace some of them with the source numerical variables.

Thank you very much for your answers. I am a mathematician but a novice in applied statistics, so I need detailed answers.

I need to estimate the influence on the outcome of each of the covariates, so I need to calculate odds ratios or something like that.

Best Answer

There are at least three regularization strategies to address this multi-collinearity/separation problem.

1) Build a Bayesian regression model that establishes a prior distribution over the regression coefficients that shrinks estimates toward zero, but supplies enough prior probability for the posterior distribution to move toward a signal in the data if it is strong enough. There are several types of priors for this, including but not limited to the Laplace, spike-and-slab, and horseshoe priors. Gelman et al. have a nice paper describing a default prior distribution on coefficients in logistic regression, which pairs well with the bayesglm function they developed in the arm package in R, which allows you to easily build and summarize logistic and other generalized linear models. You can read their paper on the subject here.

2) Penalized regression with L1 norm (LASSO regression), L2 norm (ridge regression), or some combination thereof (the elastic net model). Tibshirani, Hastie, and colleagues have developed a package in R called glmnet, which implements elastic net regression (thus L1 and L2 regression, since they are special cases of the elastic net). This package includes the logit model. There is an excellent vignette of the package, at the end of which you can find useful references on regularization in general as well as for the ridge/LASSO/elastic-net framework. If you want to watch a video version of the vignette, and learn a lot of other stuff, too, I recommend taking their Stanford online course, as well.

3) Another way to deal with multi-collinearity problems in logistic and other generalized linear models is through boosted regression. In boosted regression models, you iteratively aggregate the inferences from many simple models called "base learners". By aggregating the estimates of many simple models, you avoid the curse of dimensionality, and you can also compute variable-importance measures. If you set up your base learners properly, multi-collinearity is no longer an issue. There's a great package in R called mboost, which implements boosted generalized linear models and multilevel generalized linear models. Another reason mboost is great is the variety of base learners available, including non-parametric smoothing splines and random fields. Amazing stuff. Even better is a related package called gamboostLSS, which allows you to build boosted regression models over each of the parameters of our likelihood, not just the mean or some other location parameter.

In your situation, I'd say the best among these methods is either the Gelman et al. recipe or the elastic net option. Of these, I'd prefer the Gelman et al. recipe, because it will yield you not only point estimates, but posterior distributions of the coefficients, as well.

Side note: The beauty of the elastic net and boosting methods, and for some priors the fully Bayesian inference method, is that by regularizing your model, you can also build models with lots of features - even models more features than observations. The regularization procedure in some sense selects those features that are most important while avoiding or at least mitigating the curse of dimensionality.

Best Answer

Related Solutions

Logistic Regression – Information Out of the Hat Matrix for Logistic Regression

Solved – Does AUC for multiple logistic regression make sense if prediction is not the goal

Related Question