Solved – How to evaluate collinearity or correlation of predictors in logistic regression

dimensionality reductionlassologisticmulticollinearityregression-strategies

In linear regression it is possible to render predictors insignificant due to multicollinearity, as discussed in this question:
How can a regression be significant yet all predictors be non-significant?

If this is the case it is possible to evaluate the amount of multicollinearity through for example the variance inflation factor (VIF).

In logistic regression this approach is not possible as far as I understand it. Nevertheless it is very common to do stepwise reduction of the variable space based on significance or to use L1 regularization to reduce the number of predictors and avoid overfit.

In this case is it not possible that you fool yourself and remove variables that might have been significant or that would have had larger beta values just because you include several collinear or highly correlated variables in your variable set? Even if you do this properly using cross- or Bootstrap validation it still intuitively feels like this could happen. Especially in areas where you don't have all the variables beforehand but rather need to build them yourself as is common in a lot of data science today where we have a lot of data available.

Is there any way to avoid this effect or at least evaluate the collinearity of the predictors?

Best Answer

Variable selection based on "significance", AIC, BIC, or Cp is not a valid approach in this context. Lasso (L1) shrinkage works but you may be disappointed in the stability of the list of "important" predictors found by lasso.

The simplest approach to understanding co-linearity is variable clustering and redundancy analysis (e.g., in the R Hmisc package functions varclus and redun). This approach is not tailored to the actual model you use. Logistic regression uses weighted $X'X$ calculations instead of regular $X'X$ considerations as used in variable clustering and redundancy analysis. But it will be close. To tailor the co-linearity assessment to the actual chosen outcome model, you can compute the correlation matrix of the maximum likelihood estimates of $\beta$ and even use that matrix as a similarity matrix in a hierarchical cluster analysis not unlike what varclus does.

Various data reduction procedures, the oldest one being incomplete principal components regression, can avoid co-linearity problems at some expense of interpretability. In general, data reduction performs better than all stepwise variable selection algorithms because of the direct way that data reduction handles co-linearity.

You can get VIFs in logistic regression. See for example the vif function that can be applied to lrm fits in the R rms package.