If you are using R, SPSS or Stata, you can look at the perturb
package. It diagnoses collinearity by adding random noise to continuous variables; for categorical variables, some are changed to different categories.
In the documentation for perturb
in R, it notes that the model need not be lm
, implying that any model (including ones built with optimal scaling or ordinal logistic) could be used.
I cannot reproduce exactly this phenomenon, but I can demonstrate that VIF does not necessarily increase as the number of categories increases.
The intuition is simple: categorical variables can be made orthogonal by suitable experimental designs. Therefore, there should in general be no relationship between numbers of categories and multicollinearity.
Here is an R
function to create categorical datasets with specifiable numbers of categories (for two independent variables) and specifiable amount of replication for each category. It represents a balanced study in which every combination of category is observed an equal number of times, $n$:
trial <- function(n, k1=2, k2=2) {
df <- expand.grid(1:k1, 1:k2)
df <- do.call(rbind, lapply(1:n, function(i) df))
df$y <- rnorm(k1*k2*n)
fit <- lm(y ~ Var1+Var2, data=df)
vif(fit)
}
Applying it, I find the VIFs are always at their lowest possible values, $1$, reflecting the balancing (which translates to orthogonal columns in the design matrix). Some examples:
sapply(1:5, trial) # Two binary categories, 1-5 replicates per combination
sapply(1:5, function(i) trial(i, 10, 3)) # 30 categories, 1-5 replicates
This suggests the multicollinearity may be growing due to a growing imbalance in the design. To test this, insert the line
df <- subset(df, subset=(y < 0))
before the fit
line in trial
. This removes half the data at random. Re-running
sapply(1:5, function(i) trial(i, 10, 3))
shows that the VIFs are no longer equal to $1$ (but they remain close to it, randomly). They still do not increase with more categories: sapply(1:5, function(i) trial(i, 10, 10))
produces comparable values.
Best Answer
The most important assumptions to check are those for any multiple regression, as explained for example in Faraway's "Practical Regression and Anova using R," Chapter 7: tests for outliers and influential observations, a plot of residuals versus fitted values (an extremely useful scatter plot that incorporates both the categorical and the continuous predictor), tests of non-linearity and distributions of residuals, and so forth.
"Multicollinearity" would seem to be a bit of an overstatement with only 2 predictor variables. If you are concerned about collinearity, you could for example see how the values of the continuous predictor are distributed among the 3 levels of the categorical predictor. The Faraway reference noted above discusses collinearity in Chapter 9. As the answer from @jur notes, its practical importance depends on the intended use of the model.