Solved – check multicollinearity before regression in R

multicollinearityr

I want to check multicollinearity to avoid any redundancy in my database before doing the multinomial logistic regression with categorical dependent variable using R, knowing that the majority of my variables expressed as dichotomous and ordinal. Not the VIF method! Is there any other method that I can use before the regression?

Best Answer

You could transform your categorial variables into one-hot encoded dummy variables before doing what @Adam-Quek suggested:

# demo dummy data
d <- iris
head(d)

      Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    1          5.1         3.5          1.4         0.2  setosa
    2          4.9         3.0          1.4         0.2  setosa
    3          4.7         3.2          1.3         0.2  setosa
    4          4.6         3.1          1.5         0.2  setosa
    5          5.0         3.6          1.4         0.2  setosa
    6          5.4         3.9          1.7         0.4  setosa

# one-hot encode dummy data
library(caret)
d2 <- data.frame(predict(dummyVars(~., d), d))
str(d2)

      Sepal.Length Sepal.Width Petal.Length Petal.Width Species.setosa Species.versicolor Species.virginica
    1          5.1         3.5          1.4         0.2              1                  0                 0
    2          4.9         3.0          1.4         0.2              1                  0                 0
    3          4.7         3.2          1.3         0.2              1                  0                 0
    4          4.6         3.1          1.5         0.2              1                  0                 0
    5          5.0         3.6          1.4         0.2              1                  0                 0
    6          5.4         3.9          1.7         0.4              1                  0                 0

Using such, you could use regular tools again (feature correlation, PCA, ...), like Adam suggested:

pairs(d2, upper.panel = NULL)

Pairs

library(corrplot)
corrplot(cor(d2), type = 'lower')

Correlation

pcs <- prcomp(d2, center = T, scale. = T, tol = 0.8)
print(pcs)

    Standard deviations:
    [1] 2.086732

    Rotation:
                            PC1
    Sepal.Length        0.4100521
    Sepal.Width        -0.2352425
    Petal.Length        0.4750053
    Petal.Width         0.4647101
    Species.setosa     -0.4508000
    Species.versicolor  0.1027178
    Species.virginica   0.3480821