I have the following problem: I'm performing a multiple logistic regression on several variables each of which has a nominal scale. I want to avoid multicollinearity in my regression. If the variables were continuous I could compute the variance inflation factor (VIF) and look for variables with a high VIF. If the variables were ordinarily scaled I could compute Spearman's rank correlation coefficients for several pairs of variables and compare the computed value with a certain threshold. But what do I do if the variables are just nominally scaled? One idea would be to perform a pairwise chi-square test for independence, but the different variables don't all have the same co-domains. So that would be another problem. Is there a possibility for solving this problem?
Solved – How to avoid collinearity of categorical variables in logistic regression
logisticmulticollinearitymultiple regressionregression
Related Solutions
You seem to include the interaction term ub:lb
, but not ub
and lb
themselves as separate predictors. This would violate the so-called "principle of marginality" which states that higher-order terms should only include variables present in lower-order terms (Wikipedia for a start). Effectively, you are now including a predictor that is just the element-wise product of ub
and lb
.
$VIF_{j}$ is just $\frac{1}{1-R_{j}^{2}}$ where $R_{j}^{2}$ is the $R^{2}$ value when you run a regression with your original predictor variable $j$ as criterion predicted by all remaining predictors (it is also the $j$-th diagonal element of $R_{x}^{-1}$, the inverse of the correlation matrix of the predictors). A VIF-value of 50 thus indicates that you get an $R^{2}$ of .98 when predicting ub
with the other predictors, indicating that ub
is almost completely redundant (same for lb
, $R^{2}$ of .97).
I would start doing all pairwise correlations between predictors, and run the aforementioned regressions to see which variables predict ub
and lb
to see if the redundancy is easily explained. If so, you can remove the redundant predictors. You can also look into ridge regression (lm.ridge()
from package MASS
in R).
More advanced multicollinearity diagnostics use the eigenvalue-structure of $X^{t}X$ where $X$ is the design matrix of the regression (i.e., all predictors as column-vectors). The condition $\kappa$ is $\frac{\sqrt{\lambda_{max}}}{ \sqrt{ \lambda_{min}}}$ where $\lambda_{max}$ and $\lambda_{min}$ are the largest and smallest ($\neq 0$) eigenvalues of $X^{t}X$. In R, you can use kappa(lm(<formula>))
, where the lm()
model typically uses the standardized variables.
Geometrically, $\kappa$ gives you an idea about the shape of the data cloud formed by the predictors. With 2 predictors, the scatterplot might look like an ellipse with 2 main axes. $\kappa$ then tells you how "flat" that ellipse is, i.e., is a measure for the ratio of the length of largest axis to the length of the smallest main axis. With 3 predictors, you might have a cigar-shape, and 3 main axes. The "flatter" your data cloud is in some direction, the more redundant the variables are when taken together.
There are some rules of thumb for uncritical values of $\kappa$ (I heard less than 20). But be advised that $\kappa$ is not invariant under data transformations that just change the unit of the variables - like standardizing. This is unlike VIF: vif(lm(y ~ x1 + x2))
will give you the same result as vif(lm(scale(y) ~ scale(x1) + scale(x2)))
(as long as there are not multiplicative terms in the model), but kappa(lm(y ~ x1 + x2))
and kappa(lm(scale(y) ~ scale(x1) + scale(x2)))
will almost surely differ.
I cannot reproduce exactly this phenomenon, but I can demonstrate that VIF does not necessarily increase as the number of categories increases.
The intuition is simple: categorical variables can be made orthogonal by suitable experimental designs. Therefore, there should in general be no relationship between numbers of categories and multicollinearity.
Here is an R
function to create categorical datasets with specifiable numbers of categories (for two independent variables) and specifiable amount of replication for each category. It represents a balanced study in which every combination of category is observed an equal number of times, $n$:
trial <- function(n, k1=2, k2=2) {
df <- expand.grid(1:k1, 1:k2)
df <- do.call(rbind, lapply(1:n, function(i) df))
df$y <- rnorm(k1*k2*n)
fit <- lm(y ~ Var1+Var2, data=df)
vif(fit)
}
Applying it, I find the VIFs are always at their lowest possible values, $1$, reflecting the balancing (which translates to orthogonal columns in the design matrix). Some examples:
sapply(1:5, trial) # Two binary categories, 1-5 replicates per combination
sapply(1:5, function(i) trial(i, 10, 3)) # 30 categories, 1-5 replicates
This suggests the multicollinearity may be growing due to a growing imbalance in the design. To test this, insert the line
df <- subset(df, subset=(y < 0))
before the fit
line in trial
. This removes half the data at random. Re-running
sapply(1:5, function(i) trial(i, 10, 3))
shows that the VIFs are no longer equal to $1$ (but they remain close to it, randomly). They still do not increase with more categories: sapply(1:5, function(i) trial(i, 10, 10))
produces comparable values.
Related Question
- Solved – How to evaluate collinearity or correlation of predictors in logistic regression
- Solved – How to calculate the variance inflation factor for a categorical predictor variable when examining multicollinearity in a linear regression model
- Solved – Lasso Regression for predicting Continuous Variable + Variable Selection
Best Answer
I would second @EdM's comment (+1) and suggest using a regularised regression approach.
I think that an elastic-net/ridge regression approach should allow you to deal with collinear predictors. Just be careful to normalise your feature matrix $X$ appropriately before using it, otherwise you will risk regularising each feature disproportionately (yes, I mean the $0/1$ columns, you should scale them such that each column has unit variance and mean $0$).
Clearly you would have to cross-validate your results to ensure some notion of stability. Let me also note that instability is not a huge problem because it actually suggests that there is not obvious solution/inferential result and simply interpreting the GLM procedure as "ground truth" is incoherent.