Solved – Test for a comparison between groups on multiple categorical variables

categorical datachi-squared-testhypothesis testingmultivariate analysisr

I have a data frame of all categorical variables (around 24) with each factor variable having multiple levels. Now I have a column which categorizes all observations into High, Low or Medium. I want to understand the statistical difference between the groups in terms of the variables and their levels. Small theoretical example below:

Medium     Day     Category
Desk       Sun     H
Desk       Mon     L
Tabl       Sun     M
Mob        Thur    H

Now I want to understand if for example Mob behaves differently for Category H vs entire pop and H vs L etc.

I have tried the chi-squared test, but it gives the association between the variable and Category but not the levels of the factors or categories. I understand that a proportionality test may also be applicable.

Could someone give few pointers about tests which I can do? Basically I want to say that the distribution of Desk is different in H compared to the overall distribution and also different compared to M and L.

Best Answer

Let's take your first goal, which is to test for a difference in the rate of desk vs. non-desk mediums across H vs. non-H categories. If this is a valid rephrasing of your goal, then you can transform your variables accordingly and run a bivariate logistic regression. Your data are probably too sparse to run even an example model (and your code isn't copy-and-pastable), so I can't give you tested and complete syntax, but here's a dry run:

summary(mod <- glm( I(Medium=="Desk") ~ I(Category=="H"), binomial() ))
predict(mod, data.frame(Category=c("H","NotH")), "response")

The significance of the one predictor here will tell you whether the difference in rates of Desk is significant in category H compared to both other categories lumped together. The second line will give you the actual predicted probability of a Desk medium for an H category vs. either of the non-H ones.

If you want to know if this category-H desk rate is different from a specific one of the other category's desk rates (let's say M), I would just run the model on a subset of the data that doesn't include the third category (let's say L). Assuming your dataset is named dat:

summary(mod <- glm( I(Medium=="Desk") ~ I(Category=="H"), binomial(), 
    dat, subset= Category!="L"))

I'm addicted to regression (and it sounds like some of your other goals here might call for multinomial logistic models by the way) so I just default to this approach; I think there is a Chi-square solution to at least some of your research questions, especially if you transform the variables first and treat the trues and falses as categories. Proportionality tests, however, are not relevant here, assuming that you're referring to the proportional-odds assumption, which doesn't apply with unordered or binary variables.

Related Question