Solved – How to test for independence with non-exclusive categorical variables

categorical datanon-independentpredictor

Introduction

I have a categorical contingency table with many rows and a binary outcome, which I count:

name  outcome1  outcome2
----  --------  --------
A     14        5       
B     17        2       
C     6         5       
D     11        8       
E     18        14

This is all fine, because yet both categories (name and outcome) are independent within each other, i.e. person A cannot be person B at the same time, and outcome1 does not occur at the same time as outcome2.

Adding Problems

However, I now want to enrich my data set by assigning classes to the agents.
The classes are not exclusive, and some may even depend on each other.
For the example above, with four classes Cx:

name  C1   C2   C3   C4 
----  ---  ---  ---  ---
A     0    0    1    1  
B     1    0    1    0  
C     1    1    0    1  
D     1    1    0    0  
E     1    1    1    0

I now want to find out whether there is a dependence of one class on the outcome of the experiment.

Possible (naïve) Solution

My idea was initially to aggregate based on the class and then perform the independence tests, so that the table would look like this:

class   outcome1  outcome2
------  --------  --------
C3      49        21
not_C3  17        13

However, it then occurred to me that I mask out the influence of the other classes with this method, because I isolate based on class, which may give me bad results if some of the classes depend strongly on each other.

Also, my real data set contains about 200 agents and 30 categories, so my method would give a lot of results which I do now know how to interpret.

The Question

With this in mind, I turn to you: What statistical method is applicable to test (in-)dependence on a data set with one categorical non-exclusive variable and one binary categorical variable?

I would like to get some result along the lines of "Category 1 is the strongest predictor for the outcome (p < 0.01). It also correlates with Category 2."

Solutions using Python or R are more than welcome, but I don't need code.
I need to know which method is applicable.

Best Answer

I suggest do poisson regression separately on outcome1 and outcome2 (response variables) with class1, class2, class3 or class4 as explanatory variables.

You say that the classes are not exclusive, but this is not a problem if you take interaction between the classes into account. You can read more about interaction in the following post: Specification and interpretation of interaction terms using glm()

How to handle the dependency between the classes (in terms of doing a poisson regression), I see no way out of. You can measure the significance of the association with a chi-squared-test, and the strength of the association with Cramer's V. If this answers your question, I do not know.

Related Solutions

Feature Engineering – Principled Approach to Collapsing Categorical Variables

If I understood correctly, you imagine a linear model where one of the predictors is categorical (e.g. college major); and you expect that for some subgroups of its levels (subgroups of categories) the coefficients might be exactly the same. So perhaps the regression coefficients for Maths and Physics are the same, but different from those for Chemistry and Biology.

In a simplest case, you would have a "one way ANOVA" linear model with a single categorical predictor: $$y_{ij} = \mu + \alpha_i + \epsilon_{ij},$$ where $i$ encodes the level of the categorical variable (the category). But you might prefer a solution that collapses some levels (categories) together, e.g. $$\begin{cases}\alpha_1=\alpha_2, \\ \alpha_3=\alpha_4=\alpha_5.\end{cases}$$

This suggests that one can try to use a regularization penalty that would penalize solutions with differing alphas. One penalty term that immediately comes to mind is $$L=\omega \sum_{i<j}|\alpha_i-\alpha_j|.$$ This resembles lasso and should enforce sparsity of the $\alpha_i-\alpha_j$ differences, which is exactly what you want: you want many of them to be zero. Regularization parameter $\omega$ should be selected with cross-validation.

I have never dealt with models like that and the above is the first thing that came to my mind. Then I decided to see if there is something like that implemented. I made some google searches and soon realized that this is called fusion of categories; searching for lasso fusion categorical will give you a lot of references to read. Here are a few that I briefly looked at:

Gerhard Tutz, Regression for Categorical Data, see pp. 175-175 in Google Books. Tutz mentions the following four papers:
Land and Friedman, 1997, Variable fusion: a new adaptive signal regression method
Bondell and Reich, 2009, Simultaneous factor selection and collapsing levels in ANOVA
Gertheiss and Tutz, 2010, Sparse modeling of categorial explanatory variables
Tibshirani et al. 2005, Sparsity and smoothness via the fused lasso is somewhat relevant even if not exactly the same (it is about ordinal variables)

Gertheiss and Tutz 2010, published in the Annals of Applied Statistics, looks like a recent and very readable paper that contains other references. Here is its abstract:

Shrinking methods in regression analysis are usually designed for metric predictors. In this article, however, shrinkage methods for categorial predictors are proposed. As an application we consider data from the Munich rent standard, where, for example, urban districts are treated as a categorial predictor. If independent variables are categorial, some modifications to usual shrinking procedures are necessary. Two $L_1$-penalty based methods for factor selection and clustering of categories are presented and investigated. The first approach is designed for nominal scale levels, the second one for ordinal predictors. Besides applying them to the Munich rent standard, methods are illustrated and compared in simulation studies.

I like their Lasso-like solution paths that show how levels of two categorical variables get merged together when regularization strength increases:

Solved – Test for a comparison between groups on multiple categorical variables

Let's take your first goal, which is to test for a difference in the rate of desk vs. non-desk mediums across H vs. non-H categories. If this is a valid rephrasing of your goal, then you can transform your variables accordingly and run a bivariate logistic regression. Your data are probably too sparse to run even an example model (and your code isn't copy-and-pastable), so I can't give you tested and complete syntax, but here's a dry run:

summary(mod <- glm( I(Medium=="Desk") ~ I(Category=="H"), binomial() ))
predict(mod, data.frame(Category=c("H","NotH")), "response")

The significance of the one predictor here will tell you whether the difference in rates of Desk is significant in category H compared to both other categories lumped together. The second line will give you the actual predicted probability of a Desk medium for an H category vs. either of the non-H ones.

If you want to know if this category-H desk rate is different from a specific one of the other category's desk rates (let's say M), I would just run the model on a subset of the data that doesn't include the third category (let's say L). Assuming your dataset is named dat:

summary(mod <- glm( I(Medium=="Desk") ~ I(Category=="H"), binomial(), 
    dat, subset= Category!="L"))

I'm addicted to regression (and it sounds like some of your other goals here might call for multinomial logistic models by the way) so I just default to this approach; I think there is a Chi-square solution to at least some of your research questions, especially if you transform the variables first and treat the trues and falses as categories. Proportionality tests, however, are not relevant here, assuming that you're referring to the proportional-odds assumption, which doesn't apply with unordered or binary variables.