Solved – Similarity between objects based on tags (binary features)

classificationclusteringcorrelationk-meanssimilarities

I have five millions of objects each of them having one or more tags. How do I compute statistically sound similarity score between each pair of the objects taking into account that:

  1. There are 100 millions of tags most of them used only once.
  2. Some tags are used very often, say one tag may be used on 1% of whole dataset.
  3. Some objects are heavily tagged with hundreds of thousands tags on them while others may be tagged only a couple of times.

Second question: how do I cluster objects taking into account 1-3? My guess is that k-means and other popular clustering techniques won't do much here. I've tried k-means already with simple distance defined as number of similar tags on the objects and clusters are so vague to the point being almost meaningless.

Best Answer

k-means clearly does not make sense, because the mean does not make sense on this kind of data.

However, don't skip the preprocessing. Preprocessing is 90% of your work. E.g. you can probably just drop any unique tag. You can use stemming to merge variations of the same word etc. But in general classic text mining techniques such as TF-IDF should be worth a try.

Now that you can't use k-means doesn't mean that all other clustering algorithms suffer from the same issue. First of all, you can try e.g. DBSCAN and OPTICS.

But maybe you aren't actually looking for clustering algorithms. Have you considered looking at groups defined implicitely e.g. by frequent itemsets? Have you looked at classic document clustering techniques? And how about using the various clustering algorithms for categorical data? There is no reason to use a vector space oriented method such as k-means on a data which naturally is not vectorized. Search on Scholar for "clustering categorical data".

Related Question