Solved – How to test for independence with non-exclusive categorical variables

categorical datanon-independentpredictor

Introduction

I have a categorical contingency table with many rows and a binary outcome, which I count:

name  outcome1  outcome2
----  --------  --------
A     14        5       
B     17        2       
C     6         5       
D     11        8       
E     18        14

This is all fine, because yet both categories (name and outcome) are independent within each other, i.e. person A cannot be person B at the same time, and outcome1 does not occur at the same time as outcome2.

Adding Problems

However, I now want to enrich my data set by assigning classes to the agents.
The classes are not exclusive, and some may even depend on each other.
For the example above, with four classes Cx:

name  C1   C2   C3   C4 
----  ---  ---  ---  ---
A     0    0    1    1  
B     1    0    1    0  
C     1    1    0    1  
D     1    1    0    0  
E     1    1    1    0

I now want to find out whether there is a dependence of one class on the outcome of the experiment.

Possible (naïve) Solution

My idea was initially to aggregate based on the class and then perform the independence tests, so that the table would look like this:

class   outcome1  outcome2
------  --------  --------
C3      49        21
not_C3  17        13

However, it then occurred to me that I mask out the influence of the other classes with this method, because I isolate based on class, which may give me bad results if some of the classes depend strongly on each other.

Also, my real data set contains about 200 agents and 30 categories, so my method would give a lot of results which I do now know how to interpret.

The Question

With this in mind, I turn to you: What statistical method is applicable to test (in-)dependence on a data set with one categorical non-exclusive variable and one binary categorical variable?

I would like to get some result along the lines of "Category 1 is the strongest predictor for the outcome (p < 0.01). It also correlates with Category 2."

Solutions using Python or R are more than welcome, but I don't need code.
I need to know which method is applicable.

Best Answer

I suggest do poisson regression separately on outcome1 and outcome2 (response variables) with class1, class2, class3 or class4 as explanatory variables.

You say that the classes are not exclusive, but this is not a problem if you take interaction between the classes into account. You can read more about interaction in the following post: Specification and interpretation of interaction terms using glm()

How to handle the dependency between the classes (in terms of doing a poisson regression), I see no way out of. You can measure the significance of the association with a chi-squared-test, and the strength of the association with Cramer's V. If this answers your question, I do not know.