I cannot reproduce exactly this phenomenon, but I can demonstrate that VIF does not necessarily increase as the number of categories increases.
The intuition is simple: categorical variables can be made orthogonal by suitable experimental designs. Therefore, there should in general be no relationship between numbers of categories and multicollinearity.
Here is an R
function to create categorical datasets with specifiable numbers of categories (for two independent variables) and specifiable amount of replication for each category. It represents a balanced study in which every combination of category is observed an equal number of times, $n$:
trial <- function(n, k1=2, k2=2) {
df <- expand.grid(1:k1, 1:k2)
df <- do.call(rbind, lapply(1:n, function(i) df))
df$y <- rnorm(k1*k2*n)
fit <- lm(y ~ Var1+Var2, data=df)
vif(fit)
}
Applying it, I find the VIFs are always at their lowest possible values, $1$, reflecting the balancing (which translates to orthogonal columns in the design matrix). Some examples:
sapply(1:5, trial) # Two binary categories, 1-5 replicates per combination
sapply(1:5, function(i) trial(i, 10, 3)) # 30 categories, 1-5 replicates
This suggests the multicollinearity may be growing due to a growing imbalance in the design. To test this, insert the line
df <- subset(df, subset=(y < 0))
before the fit
line in trial
. This removes half the data at random. Re-running
sapply(1:5, function(i) trial(i, 10, 3))
shows that the VIFs are no longer equal to $1$ (but they remain close to it, randomly). They still do not increase with more categories: sapply(1:5, function(i) trial(i, 10, 10))
produces comparable values.
Best Answer
No, a "standard multiple regression" is not appropriate here, assuming by this you mean a regression with a single continuous variable as the response. Regression of this sort can only makes any sense if the different levels of the response can be seen as different values on a continuous variable. There is no way this can be the case with Subject. Even if it were you would have a lot of problems with dealing with the usual assumptions in fitting such a model.
I don't know how your stats package let you do this - probably it converted Subject into a continuous variable based on its internal coding eg 1=Maths, 2=business, etc probably in alphabetical order - but it will certainly have given meaningless results.
A chi-square test would tell you if there is a relationship between the variables, but if you want to understand whether Big 5 and Learning style are related specifically to Subject, you will probably be best off with a multinomial regression.