I have a data frame of all categorical variables (around 24) with each factor variable having multiple levels. Now I have a column which categorizes all observations into High
, Low
or Medium
. I want to understand the statistical difference between the groups in terms of the variables and their levels. Small theoretical example below:
Medium Day Category
Desk Sun H
Desk Mon L
Tabl Sun M
Mob Thur H
Now I want to understand if for example Mob
behaves differently for Category H
vs entire pop and H
vs L
etc.
I have tried the chi-squared test, but it gives the association between the variable and Category
but not the levels of the factors or categories. I understand that a proportionality test may also be applicable.
Could someone give few pointers about tests which I can do? Basically I want to say that the distribution of Desk
is different in H
compared to the overall distribution and also different compared to M
and L
.
Best Answer
Let's take your first goal, which is to test for a difference in the rate of desk vs. non-desk mediums across H vs. non-H categories. If this is a valid rephrasing of your goal, then you can transform your variables accordingly and run a bivariate logistic regression. Your data are probably too sparse to run even an example model (and your code isn't copy-and-pastable), so I can't give you tested and complete syntax, but here's a dry run:
The significance of the one predictor here will tell you whether the difference in rates of Desk is significant in category H compared to both other categories lumped together. The second line will give you the actual predicted probability of a Desk medium for an H category vs. either of the non-H ones.
If you want to know if this category-H desk rate is different from a specific one of the other category's desk rates (let's say M), I would just run the model on a subset of the data that doesn't include the third category (let's say L). Assuming your dataset is named
dat
:I'm addicted to regression (and it sounds like some of your other goals here might call for multinomial logistic models by the way) so I just default to this approach; I think there is a Chi-square solution to at least some of your research questions, especially if you transform the variables first and treat the trues and falses as categories. Proportionality tests, however, are not relevant here, assuming that you're referring to the proportional-odds assumption, which doesn't apply with unordered or binary variables.