Solved – Test independence between quantitative and categorical predictors for logistic regression

categorical datahypothesis testingindependencelogisticmulticollinearity

I have 2 categorical variables with 8 unordered categories and multiple numerical variables and I want to train a logistic regression model. I want to test the independence between all my predictor variables, and remove the ones that are dependent, meaning that they are redundant in my model.

Is there any universal statistical test to test the independence between quantitative and categorical predictors? For two quantitative variables I know I could use correlation tests, and for two categorical, the $\chi^2$ test, but what about a quantitative and a categorical variable?

Best Answer

I have 8 categories and not ordered. In fact I have 2 categorical variables and multiple numerical variables and I want to train a logistic regression model. I want to test the independence between all my variables (only predictors), and remove the ones that are dependent, meaning that they are redundant in my model.

It would have helped you if you'd started with this information.

1) Pairs of variables with highly significant correlations (i.e. very small p-values) may both be needed in a model - indeed they may be very far from "redundant"; a significant pairwise correlation tells you very little about that. Indeed with large samples even trivially small correlations may be highly significant. Hypothesis tests answer the wrong question here (they don't tell you about the impact of the correlation on your inference).

2) If measures of association with categorical variables don't measure the same kind of dependency that matters in your model, it's not telling you what you need to know about.

3) It's quite possible for variables to be pairwise not all that correlated but highly dependent in larger groups; you can check every pairwise correlation and find it's almost zero, yet still have redundant variables across the whole set.

Your entire approach is simply misguided - it might help, but it might utterly fail to avoid redundancy and it might get you to throw out important variables for no good reason at all. You approach tells you less than you might think about redundancy of your variables.

You need to consider your variables as a collection. The right sort of thing to do is check the condition of your $X$-matrix, or something related to it. Sometimes people use things like variance inflation factors, or how completely each $X_j$ is predicted by the collection of previous (or even all other) $X$'s or various other measures. This sort of checking is fairly standard while doing regressions and GLMs.

Related Question