Solved – Clustering a binary matrix

binary dataclusteringr

I have a semi-small matrix of binary features of dimension 250k x 100. Each row is a user and the columns are binary "tags" of some user behavior e.g. "likes_cats".

user  1   2   3   4   5  ...
-------------------------
A     1   0   1   0   1
B     0   1   0   1   0
C     1   0   0   1   0

I would like to fit the users into 5-10 clusters and analyze the loadings to see if I can interpret groups of user behavior. There appears to be quite a few approaches to fitting clusters on binary data – what do we think might be the best strategy for this data?

PCA
Making a Jaccard Similarity matrix, fitting a hierarchical cluster and then using the top "nodes".
K-medians
K-medoids
Proximus?
Agnes

So far I've had some success with using hierarchical clustering but I'm really not sure it's the best way to go..

tags = read.csv("~/tags.csv")
d = dist(tags, method = "binary")
hc = hclust(d, method="ward")
plot(hc)
cluster.means = aggregate(tags,by=list(cutree(hc, k = 6)), mean)

enter image description here

Best Answer

Latent class analysis is one possible approach.

Take the following probability distribution where A, B, and C can take on values of 1 or 0.

$P(A_i, B_j, C_k)$

If these were independent of each other, then we would expect to see:

$P(A_i, B_j, C_k)=P(A_i)P(B_j)P(C_k)$

Once this possiblity is eliminated, we might hypothesize that any observed dependency is due to values clustering within otherwise unobserved subgroups. To test this idea, we can estimate the following model:

$P(A_i, B_j, C_k)=P(X_n)P(A_i|X_n)P(B_j|X_n)P(C_k|X_n)$

Where $X$ is a latent categorical variable with $n$ levels. You specfy $n$, and the model parameters (marginal probabilities of class membership, and class specific probabilities for each variable) can be estimated via expectation-maximization.

In practice, you could estimate several models, with $5 \le n \le 10$, and "choose" the best model based on theory, likelihood based fit indices, and classification quality (which can be assessed by calculating posterior probabilities of class membership for the observations).

However, trying to identify meaningful patterns in 100 variables with 5-10 groups will likely require reducing that list down prior to estimating the model, which is a tricky enough topic in its own right (REF).

Best Answer

Related Solutions

Solved – How to test whether the clustering of binary data is significant

Solved – Select best set of binary variables for clustering known sample labels

Related Question