Clustering Categorical Data – Independence of Variables

categorical dataclusteringindependence

When trying to explain cluster analyses, it is common for people to misunderstand the process as being related to whether the variables are correlated. One way to get people past that confusion is a plot like this:

This clearly displays the difference between the question of whether there are clusters and the question of whether the variables are related. However, this only illustrates the distinction for continuous data. I'm having trouble thinking of an analog with categorical data:

ID  property.A  property.B
1   yes         yes
2   yes         yes
3   yes         yes
4   yes         yes
5   no          no
6   no          no
7   no          no
8   no          no

We can see that there are two clear clusters: people with both property A and B, and those with neither. However, if we look at the variables (e.g., with a chi-squared test), they are clearly related:

tab
#      B
# A     yes no
#   yes   4  0
#   no    0  4
chisq.test(tab)
# X-squared = 4.5, df = 1, p-value = 0.03389

I find I am at a loss for how to construct an example with categorical data that is analogous to the one with continuous data above. Is it even possible to have clusters in purely categorical data without the variables being related as well? What if the variables have more than two levels, or as you have larger numbers of variables? If the clustering of observations does necessarily entail relationships between the variables and vice versa, does that imply that clustering is not really worth doing when you only have categorical data (i.e., should you just analyze the variables instead)?

Update: I left a lot out of the original question because I wanted to just focus on the idea that a simple example could be created that would be immediately intuitive even to someone who was largely unfamiliar with cluster analyses. However, I recognize that a lot of clustering is contingent on choices of distances and algorithms, etc. It may help if I specify more.

I recognize that Pearson's correlation is really only appropriate for continuous data. For the categorical data, we could think of a chi-squared test (for a two-way contingency table) or a log-linear model (for multi-way contingency tables) as a way to assess the independence of the categorical variables.

For an algorithm, we could imagine using k-medoids / PAM, which can be applied to both the continuous situation and the categorical data. (Note that, part of the intention behind the continuous example is that any reasonable clustering algorithm should be able to detect those clusters, and if not, a more extreme example should be possible to construct.)

Regarding the conception of distance, I assumed Euclidean for the continuous example, because it would be the most basic for a naive viewer. I suppose the distance that is analogous for categorical data (in that it would be the most immediately intuitive) would be simple matching. However, I am open to discussions of other distances if that leads to a solution or just an interesting discussion.

Best Answer

Consider the clear-cluster case with uncorrelated scale variables - such as the top-right picture in the question. And categorize its data.

We subdivided the scale range of both variables X and Y into 3 bins which now onward we treat as categorical labels. Moreover, we'll declare them nominal, not ordinal, because the question asked is implicitly and primarily about qualitative data. The spots' size is the frequency in a frequency cross-table cell; all cases in the same cell are considered identical.

Intuitively and most generally, "clusters" are defined as clots of data points separated by sparse regions in the data "space". It was initially with scale data and it remains same impression in the cross-tabulation of the categorized data. X and Y now categorical, but they still look uncorrelated: chi-square association is very near to zero. And clusters are there.

But recall we're dealing with nominal categories which order in the table is arbitrary. We may reorder whole rows and/or columns as we like, without affecting the observed chi-square value. Do reordering...

...to meet that clusters just dissapeared. The four cells, a1, a3, c1 and c3, could be united in a single cluster. So no, we really don't have any clusters in the categorical data.

Cases of cells a1 and c3 (or likewise of a3 and c1) are complete-dissimilar: they don't share same attribures. To induce clusters in our data - a1 and c3 to form the clusters - we have to empty, to some great extent, confounding cells a3 and c1, by dropping these cases from the dataset.

Now clusters do exist. But at the same time we lost uncorrelatedness. The diagonal structure showing up in the table signals that chi-square statistic got far from zero.

Pity. Let us attempt to preserve uncorrelatedness and more or less clear clusters at the same time. We may decide to sufficiently empty just cell a3, for example, and then to consider a1+c1 as a cluster which opposes cluster c3:

That operation didn't bring Chi-square any far from zero...

[Indeed, table such as for example
 6   6   1
 6   6   1
 1   1   0
retains about the same very low chi-square association after
dividing 2nd column by 3 and multiplying 2nd row by 3, which gives
 6   2   1
18   6   3
 1  1/3  0
Cell (1,2) got thrice lower frequency. We had, however, to upheave
cell (2,1) frequency thrice, to keep Chi-sq almost as before.]

...but the situation with clusters is confused. Cluster a1+c1 contains cases that partly identical, partly half-dissimilar. That a cluster is relatively low-homogeneous is itself not a preclusion for a clear-cluster structure in a dataset. However, the problem with our, categorical data is that cluster a1+c1 is in no way better than cluster c1+c3, its symmetric analogue. That means that the cluster solution is unstable - it will depend on the case order in the dataset. An unstable solution, even it is relatively "clear-clustered", is a bad solution, unreliable.

The only way to overcome the problem and to make solution both clear and stable will be to untie cell c3 from cell c1 by moving its data below to cell b3 (or to b2).

So we have clear clusters a1+c1 vs b3. But look, here again the diagonal pattern shows up - and chi-square of the table bounds high above zero.

Conclusion. It is impossible to have two chi-square-unassociated nominal variables and good clusters of the data cases simultaneously. Clear & stable clusters imply inducing variable association.

It is also clear that if the association is present - i.e. diagonal pattern exists or achievable by reordering - then clusters must exist. This is because the nature of categorical data ("all or nothing") doesn't permit half tones and borderline conditions, therefore picture like bottom-left in the OP's question can't emerge with categorical, nominal data.

I surmise that as we get more and more nominal variables (instead of just two) which are bivariately chi-square unrelated, we come closer to the possibility to have clusters. But zero multivariate chi-square, I expect still will be incompatible with clusters. That yet has to be shown (not by me or not this time).

Finally, a remark on @Bey's (aka user75138) answer which I partly supported. I've commented it with my agreement on that one has first to decide on the distance metric and the association measure before he can put the question "is variable association independent from case clusters?". This is because no universal association measure exist, nor universal statistical definition of clusters. I would further add, he must also decide on the clustering technique. Various methods of clustering differently define what are "clusters" they are after. So, the whole statement might be true.

That said, the weakness of such a dictum is that it's too broad. One should attempt to show concretely, whether and where a choice on distance metric / association measure / cluster method opens room to reconcile uncorrelatedness with clusteredness, for nominal data. He would keep in mind, particular, that not all the many proximity coefficients for binary data make sense with nominal data, since for nominal data, "both cases lack this attribute" can never be the ground for their similarity.

Update, reporting my simulations findings.

Repearedly, 2- or 3-variable nominal data were randomly generated, with the number of categories in a variable varying from 3 to 5, and the total sample size varying from 300 to 600. Bivariate chi-square association was very low in all the generated datasets (Cramer's V almost never above $.1$). Also, for 3-variable data 3-way chi-square association (main effects multinomial model), Pearson and Log-likelihood, was low and never significant.

Two methods of cluster analysis were used to cluster cases in each of the generated datasets - Hierarchical clustering (complete method, Dice similarity measure), and TwoStep clustering (basing on log-likelihood distance). Then a range of cluster solutions (varying by the number of clusters in a solution) from each analysis was checked by some internal clustering criterions (silhouette statistic, point-biserial $r$, AIC & BIC) in search of a relatively "good" solution, indicating the presence of clear clusters. A liked solution was then tested for stability by means of permuting case order in the dataset and redoing the clustering on it.

Findings generally support reasoning displayed above within the answer. There was never very clear clusters (such as might occure if the chi-square association be strong). And the results of the different clustering criterions often contradicted each other (which isn't very likely to expect when clusters are really clear).

Sometimes hierarchical clustering would offer a k-cluster solution that is somewhat good, as observed via a clustering criterion plot; however, testing it for stability will fail to show it is stable. For example, this 3-variable 4x4x3 data

   V1  V2  V3   Count
    1   1   1   21
            2   24
            3   1
        2   1   22
            2   26
            3   1
        3   1   1
            2   1
            3   1
        4   1   17
            2   20
            3   1
    2   1   1   10
            2   12
            3   1
        2   1   10
            2   12
            3   1
        3   1   1
            2   1
            3   1
        4   1   8
            2   9
            3   1
    3   1   1   24
            2   28
            3   1
        2   1   25
            2   30
            3   1
        3   1   1
            2   1
            3   1
        4   1   19
            2   23
            3   1
    4   1   1   24
            2   28
            3   1
        2   1   26
            2   30
            3   1
        3   1   1
            2   1
            3   1
        4   1   19
            2   23
            3   1

when clustered by the complete linkage hiearchical method, Dice similarity, seem to be split - quite reasonably - into 9 clusters - in this case in agreement among the three internal validity judges:

But the solution is not stable, as seen from the noncomplete sparsity of the confusion matrix of the original solution against the permuted (case-reordered) solution:

If the solution had been stable (as it would likely be had we continuous data) we would have chosen the 9-cluster solution as sufficiently persuasive one.

Clustering based on log-likelihood distance (as opposed to Dice similarity) may give stable and "not bad" (internally quite valid) solutions. But that is because the distance, at least as it is in TwoStep cluster of SPSS, encourage and foster high-populated clusters and neglect low-populated ones. It does not demand clusters with very low frequency inside to be dense inside (that seem to be the "policy" of TwoStep cluster analysis, which was designed specially for big data and to give few clusters; so small clusters are seen as if outliers). For example, these 2-variable data

would be combined by TwoStep into 5 clusters as shown, stably, and the 5-cluster solution is not at all bad as judged by some clustering criterions. Because the four populated clusters are very dense inside (actually, all cases identical), and only one, fifth cluster, which includes few cases, is extremely entropy'ed. So apparent actually is 12-cluster solution, not 5-cluster, but 12 is the total number of cells in the frequency table, which, as a "cluster solution" is trivial and uninteresting.

Context of my answer

I self-studied this question yesterday (the part concerning the possibility to use mixed models here). I shamelessly dump my fresh new understanding on this approach for 2x2 tables and wait for more advanced peers to correct my imprecisions or misunderstandings. My answer will be then lengthy and overly didactic (at least trying to be didactic) in order to help but also expose my own flaws. First of all, I must say that I shared your confusion that you stated here.

I've read about multi-level models, which sound like they are intended the handle this situation when the underlying variables are continuous (e.g., real numbers) and when a linear model is appropriate

I studied all the examples from this paper random-effects modelling of categorical response data. The title itself contradicts this thought. For our problem with 2x2 tables with repeated measurement, the example in section 3.6 is germane to our discussion. This is for reference only as my goal is to explain it. I may edit out this section in the future if this context is not necessary anymore.

The model

General Idea
The first thing to understand is that the random effect is modelled not in a very different way as in regression over continuous variable. Indeed a regression over a categorical variable is nothing else than a linear regression over the logit (or another link function like probit) of the probability associated with the different levels of this categorical variable. If $\pi_i$ is the probability to answer yes at the question $i$, then $logit(\pi_{i})= FixedEffects_i + RandomEffect_i$. This model is linear and random effects can be expressed in a classical numerical way like for example $$RandomEffect_i\sim N(0,\sigma)$$ In this problem, the random effect represents the subject-related variation for the same answer.

Our case
For our problem, we want to model $\pi_{ijv}$ the probability of the subject to answer "yes" for the variable v at interview time j. The logit of this variable is modeled as a combination of fixed effects and subject-related random effects.
$$logit(\pi_{ijv})=\beta_{jv}+u_{iv}$$

About the fixed effects

The fixed effects are then related to the probability to answer "yes" at time j at the question v. According to your scientific goal you can test with a likelihood ratio to test if the equality of certain fixed effects must be rejected. For example, the model where $\beta_{1v}=\beta_{2v}=\beta_{3v}...$ means that there is no change tendency in the answer from time 1 to time 2. If you assume that this global tendency does not exist, which seems to be the case for your study, you can drop the $i$ straightaway in your model $\beta_{jv}$ becomes $\beta_{v}$. By analogy, you can test by a likelihood ratio if the equality $\beta_{1}=\beta_{2}$ must be rejected.

About random effects

I know it's possible to model random effects by something else than normal errors but I prefer to answer on the basis of normal random effects for the sake of simplicity. The random effects can be modelled in different ways. With the notations $u_{ij}$ I assumed that a random effect is drawn from its distribution each time a subject answer a question.This is the most specific degree of variation possible. If I used $u_{i}$ instead, it would have mean that a random effect is drawn for each subject $i$ and is the same for each question $v$ he has to answer (some subjects would then have a tendency to answer yes more often). You have to make a choice. If I understood well, you can also have both random effects $u_{i}\sim N(0,\sigma_1)$ which is subject-drawn and $u_{ij}\sim N(0,\sigma_2)$ which is subject+answer-drawn. I think that your choice depends of the details of your case. But If I understood well, the risk of overfitting by adding random effects is not big, so when one have a doubt, we can include many levels.

A proposition

I realize how weird my answer is, this is just an embarrassing rambling certainly more helpful to me than to others. Maybe I ll edit out 90% of it. I am not more confident, but more disposed to get to the point. I would suggest to compare the model with nested random effects ($u_{i}+u_{iv}$) versus the model with only the combinated random effect ($u_{iv}$). The idea is that the $u_i$ term is the sole responsible for the dependency between answers. Rejecting independence is rejecting the presence of $u_{i}$. Using glmer to test this would give something like :

model1<-glmer(yes ~ Question + (1 | Subject/Question ), data = df, family = binomial)
model2<-glmer(yes ~ Question + (1 | Subject:Question ), data = df, family = binomial)
anova(model1,model2)

Question is a dummy variable indicating if the question 1 or 2 is asked. If I understood well, (1 | Subject/Question ) is related to the nested structure $u_{i}+u_{iv}$ and (1 |Subject:Question) is just the combination $u_{iv}$. anova computes a likelihood ratio test between the two models.

Best Answer

Related Solutions

Solved – How to create clusters with a completely categorical data

Solved – How to handle multiple measurements per participant, with categorical data

Context of my answer

The model

About the fixed effects

About random effects

A proposition

Related Question