Solved – Clustering a binary matrix

binary dataclusteringr

I have a semi-small matrix of binary features of dimension 250k x 100. Each row is a user and the columns are binary "tags" of some user behavior e.g. "likes_cats".

user  1   2   3   4   5  ...
-------------------------
A     1   0   1   0   1
B     0   1   0   1   0
C     1   0   0   1   0

I would like to fit the users into 5-10 clusters and analyze the loadings to see if I can interpret groups of user behavior. There appears to be quite a few approaches to fitting clusters on binary data – what do we think might be the best strategy for this data?

  • PCA

  • Making a Jaccard Similarity matrix, fitting a hierarchical cluster and then using the top "nodes".

  • K-medians

  • K-medoids

  • Proximus?

  • Agnes

So far I've had some success with using hierarchical clustering but I'm really not sure it's the best way to go..

tags = read.csv("~/tags.csv")
d = dist(tags, method = "binary")
hc = hclust(d, method="ward")
plot(hc)
cluster.means = aggregate(tags,by=list(cutree(hc, k = 6)), mean)

enter image description here

Best Answer

Latent class analysis is one possible approach.

Take the following probability distribution where A, B, and C can take on values of 1 or 0.

$P(A_i, B_j, C_k)$

If these were independent of each other, then we would expect to see:

$P(A_i, B_j, C_k)=P(A_i)P(B_j)P(C_k)$

Once this possiblity is eliminated, we might hypothesize that any observed dependency is due to values clustering within otherwise unobserved subgroups. To test this idea, we can estimate the following model:

$P(A_i, B_j, C_k)=P(X_n)P(A_i|X_n)P(B_j|X_n)P(C_k|X_n)$

Where $X$ is a latent categorical variable with $n$ levels. You specfy $n$, and the model parameters (marginal probabilities of class membership, and class specific probabilities for each variable) can be estimated via expectation-maximization.

In practice, you could estimate several models, with $5 \le n \le 10$, and "choose" the best model based on theory, likelihood based fit indices, and classification quality (which can be assessed by calculating posterior probabilities of class membership for the observations).

However, trying to identify meaningful patterns in 100 variables with 5-10 groups will likely require reducing that list down prior to estimating the model, which is a tricky enough topic in its own right (REF).

Related Question