Solved – check multicollinearity before regression in R

multicollinearityr

I want to check multicollinearity to avoid any redundancy in my database before doing the multinomial logistic regression with categorical dependent variable using R, knowing that the majority of my variables expressed as dichotomous and ordinal. Not the VIF method! Is there any other method that I can use before the regression?

Best Answer

You could transform your categorial variables into one-hot encoded dummy variables before doing what @Adam-Quek suggested:

# demo dummy data
d <- iris
head(d)

      Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    1          5.1         3.5          1.4         0.2  setosa
    2          4.9         3.0          1.4         0.2  setosa
    3          4.7         3.2          1.3         0.2  setosa
    4          4.6         3.1          1.5         0.2  setosa
    5          5.0         3.6          1.4         0.2  setosa
    6          5.4         3.9          1.7         0.4  setosa

# one-hot encode dummy data
library(caret)
d2 <- data.frame(predict(dummyVars(~., d), d))
str(d2)

      Sepal.Length Sepal.Width Petal.Length Petal.Width Species.setosa Species.versicolor Species.virginica
    1          5.1         3.5          1.4         0.2              1                  0                 0
    2          4.9         3.0          1.4         0.2              1                  0                 0
    3          4.7         3.2          1.3         0.2              1                  0                 0
    4          4.6         3.1          1.5         0.2              1                  0                 0
    5          5.0         3.6          1.4         0.2              1                  0                 0
    6          5.4         3.9          1.7         0.4              1                  0                 0

Using such, you could use regular tools again (feature correlation, PCA, ...), like Adam suggested:

pairs(d2, upper.panel = NULL)

library(corrplot)
corrplot(cor(d2), type = 'lower')

pcs <- prcomp(d2, center = T, scale. = T, tol = 0.8)
print(pcs)

    Standard deviations:
    [1] 2.086732

    Rotation:
                            PC1
    Sepal.Length        0.4100521
    Sepal.Width        -0.2352425
    Petal.Length        0.4750053
    Petal.Width         0.4647101
    Species.setosa     -0.4508000
    Species.versicolor  0.1027178
    Species.virginica   0.3480821

Related Solutions

Solved – How to test for and remedy multicollinearity in optimal scaling/ordinal regression with categorical IVs

If you are using R, SPSS or Stata, you can look at the perturb package. It diagnoses collinearity by adding random noise to continuous variables; for categorical variables, some are changed to different categories.

In the documentation for perturb in R, it notes that the model need not be lm, implying that any model (including ones built with optimal scaling or ordinal logistic) could be used.

Solved – Binary Logistic Regression Multicollinearity Tests

I'm glad you like my answer :-)

It's not that there is no valid method of detecting collinearity in logistic regression: Since collinearity is a relationship among the independent variables, the dependent variable doesn't matter.

What is problematic is figuring out how much collinearity is too much for logistic regression. David Belslely did extensive work with condition indexes. He found that indexes over 30 with substantial variance accounted for in more than one variable was indicative of collinearity that would cause severe problems in OLS regression. However, "severe" is always a judgment call. Perhaps the easiest way to see the problems of collinearity is to show that small changes in the data make big changes in the results.

[this paper http://www.medicine.mcgill.ca/epidemiology/joseph/courses/epib-621/logconfound.pdf] offers examples of collinearity in logistic regression. It even shows that R detects exact collinearity, and, in fact, some cases of approximate collinearity will cause the same warning:

Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred

Nevertheless, we can ignore this warning and run

set.seed(1234)
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- x1 + x2 + rnorm(100, 0, 1)

y <- x1 + 2*x2 + 3*x3 + rnorm(100)
ylog <- cut(y, 2, c(1,0))

m1<- glm(ylog~x1+x2+x3, family = binomial)
coef(m1)

which yields -2.55, 1.97, 5.60 and 12.54

We can then slightly perturb x1 and x2, add them for a new x3 and run again:

x1a <- x1+rnorm(100,0,.01)
x2a <- x2+rnorm(100,0, .01)
x3a <- x1a + x2a + rnorm(100, 0, 1)

ya <- x1a + 2*x2a + 3*x3a + rnorm(100)
yloga <- cut(ya, 2, c(1,0))


m2<- glm(ylog~x1a+x2a+x3a, family = binomial)
coef(m2)

this yields wildly different coefficients: 0.003, 3.012, 3.51 and -0.41

and yet, this set of independent variables does not have a high condition index:

library(perturb)
colldiag(m1)

says the maximum condition index is 3.54.

I am unaware if anyone has done any Monte Carlo studies of this; if not, it seems a good area for research

Best Answer

Related Solutions

Solved – How to test for and remedy multicollinearity in optimal scaling/ordinal regression with categorical IVs

Solved – Binary Logistic Regression Multicollinearity Tests

Related Question