Solved – Collinearity in R for dataset with 40+ variables

large datamatrixmodelingmulticollinearityr

I have a big data matrix with 6000 rows (observations) and 45 columns (44 predictive variables (categorical and continuous) and 1 response variable (0 or 1). I want to check the correlation/ multicollinearity in R. I have looked into cor() and heat map so far, but it seems like for a big data I need to use something else. Please advice.

Best Answer

I also like VIF's, but another way would be to estimate the mutual information between/among the various predictors as it isn't concerned solely with a linear relationship. The idea is to only use those covariates with low mutual information as they are telling you something different. Check out the infotheo or entropy pkgs.