Solved – Clustering binary categorical data

categorical dataclusteringcorrespondence-analysispca

I have some data where I have certain classes (c1, c2, c3, c4 …) and the data comprises of binary vectors where 1 and 0 denote that an entry belongs to a class or not. The number of classes will be > 200. The data will look like this:

c1    c2    c3    c4    ...
 1     0     0     0    ...
 0     1     1     0

Would this data come under "Categorical" type?

Details:

  • Sample Size : ~20000

  • No. of classes : 300

  • Data Matrix Sparsity : 99.52%

  • Problem Statement: The classes that I am talking about are medical services provided by Hospitals. If a hospital provides the service we just put 1 or else 0 in the binary vector. I want to cluster similar hospitals on the basis of their services.

I tried out PCA for dimension reduction on this dataset and I even got good clusters with DBSCAN but I read that for categorical sparse data PCA is not recommended and also Euclidian distance as the distance measure is not good.

I am planning to use MCA (Multiple Correspondence Analysis) but I cannot figure out how am I supposed to represent the data for that.

Best Answer

A simple approach is to fit a mixture of "Naive Bayes" models using EM. The structure of the mixture model is $P(x_{i1},\ldots,x_{in}) = \sum_k P(y_i=k) \prod_j P(x_{ij}|y_i=k)$. Here, $i$ indexes the data points, each of which is a vector of $n$ binary features. $y_i$ is the index of the cluster to which data point $i$ belongs. $P(y_i=k)$ is the (learned) probability of a point being generated by cluster $k$. $P(x_{ij}|y_i=k)$ is the (learned) probability of generating the value of feature $j$ for points belonging to class $k$.

This model treats the binary values in each cluster as independent conditioned on their membership in the cluster. This is the discrete analogue of fitting a (diagonal) Gaussian mixture model. My former student Tony Fountain and I applied this kind of model to cluster patterns of die failure on silicon wafers.

This model is an instance of what is known variously as Latent Class Analysis or Latent Trait Analysis. A good overview of these techniques can be found at this website by John Uebersax in which he discusses a variety of Latent Class models including Probit Discrete Latent Trait Models. The Probit model uses a latent multivariate Gaussian distribution to model each cluster, which can capture pairwise correlations among the binary responses within the cluster. I believe Uebersax provides a software package, but I have not tried it.

Related Question