Solved – Clustering high-dimensional sparse binary data

clustered-standard-errorsclusteringcross-validationdata miningr

I am trying to cluster Facebook users based on their likes.

I have two problems: First, since there is no dislike in Facebook all I have is having likes (1) for some items but for the rest of the items, the value is unknown and not necessarily zero (corresponding to a dislike). If use 0 for unknowns, then I think my clusters will be biased.
Any suggestion?

Second, supposed I assign 0 to unknown items and cluster them, using a hierarchichal clustering method using a binary measure distance such as Jaccard, Tanimoto,…

How can I evaluate the clustering results? The within and outside SSE is not appropriate for binary data. If I use median centers, I m afraid most of them are going to be zero as I have a sparse feature matrix. So what would be a good way to evaluate the clusters?

Best Answer

Consider using a graph based approach.

Try to find a threshold to define when users are "somewhat similar". It can be quite low. Build a graph of these somewhat similar users.

Then use a Clique detection approach to find groups in this graph.