Solved – How to analyse three categorical variables

categorical datacontingency tablesdatasetlog-linear

I need some help identifying a test to use for three categorical variables: Subject (maths, business etc), Big 5, and Learning style. I am carrying out research on whether there is a relationship among the above three variables. There are no scores, only categories.

I looked at the chi-squared test, but it doesn't appear to be helpful if I have more than two variables. Participants are all one gender.

Best Answer

No, a "standard multiple regression" is not appropriate here, assuming by this you mean a regression with a single continuous variable as the response. Regression of this sort can only makes any sense if the different levels of the response can be seen as different values on a continuous variable. There is no way this can be the case with Subject. Even if it were you would have a lot of problems with dealing with the usual assumptions in fitting such a model.

I don't know how your stats package let you do this - probably it converted Subject into a continuous variable based on its internal coding eg 1=Maths, 2=business, etc probably in alphabetical order - but it will certainly have given meaningless results.

A chi-square test would tell you if there is a relationship between the variables, but if you want to understand whether Big 5 and Learning style are related specifically to Subject, you will probably be best off with a multinomial regression.

Related Solutions

Solved – Is multicollinearity implicit in categorical variables

I cannot reproduce exactly this phenomenon, but I can demonstrate that VIF does not necessarily increase as the number of categories increases.

The intuition is simple: categorical variables can be made orthogonal by suitable experimental designs. Therefore, there should in general be no relationship between numbers of categories and multicollinearity.

Here is an R function to create categorical datasets with specifiable numbers of categories (for two independent variables) and specifiable amount of replication for each category. It represents a balanced study in which every combination of category is observed an equal number of times, $n$:

trial <- function(n, k1=2, k2=2) {
  df <- expand.grid(1:k1, 1:k2)
  df <- do.call(rbind, lapply(1:n, function(i) df))
  df$y <- rnorm(k1*k2*n)
  fit <- lm(y ~ Var1+Var2, data=df)
  vif(fit)
}

Applying it, I find the VIFs are always at their lowest possible values, $1$, reflecting the balancing (which translates to orthogonal columns in the design matrix). Some examples:

sapply(1:5, trial) # Two binary categories, 1-5 replicates per combination
sapply(1:5, function(i) trial(i, 10, 3)) # 30 categories, 1-5 replicates

This suggests the multicollinearity may be growing due to a growing imbalance in the design. To test this, insert the line

  df <- subset(df, subset=(y < 0))

before the fit line in trial. This removes half the data at random. Re-running

sapply(1:5, function(i) trial(i, 10, 3))

shows that the VIFs are no longer equal to $1$ (but they remain close to it, randomly). They still do not increase with more categories: sapply(1:5, function(i) trial(i, 10, 10)) produces comparable values.

Solved – 3 categorical IV and 1 categorical DV — what test to use

I would suggest you the following high-level data analysis strategy/workflow:

Start with performing exploratory data analysis (EDA). This will provide you with a sense of your data set as well as reveal the data set's features, which might be helpful in further steps (assumptions, etc.).
Perform regression analysis. Your statement about inability of using logistic regression is incorrect, but this due to confusion that the term logistic regression often is used to refer to a model with a binary DV. Indeed, logistic regression is applicable in your case and is referred to as multinomial logistic regression, since your DV is of unordered categorical type. Should your DV be ordered, then that would be a case of an ordered logistic regression. The analysis IMHO should include evaluating the model's goodness-of-fit (GoF) and other relevant metrics (see above-referenced articles as a starting point, including for information on tests, etc.).
Interpret the results of your analysis, based on your research goals and questions.

Best Answer

Related Solutions

Solved – Is multicollinearity implicit in categorical variables

Solved – 3 categorical IV and 1 categorical DV — what test to use

Related Question