Clustering – Clustering a Long List of Strings (Words) into Similarity Groups

clusteringk-meanspattern recognition

I have the following problem at hand: I have a very long list of words, possibly names, surnames, etc. I need to cluster this word list, such that similar words, for example words with similar edit (Levenshtein) distance appears in the same cluster. For example "algorithm" and "alogrithm" should have high chances to appear in the same cluster.

I am well aware of the classical unsupervised clustering methods like k-means clustering, EM clustering in the Pattern Recognition literature. The problem here is that these methods work on points which reside in a vector space. I have words of strings at my hand here. It seems that, the question of how to represent strings in a numerical vector space and to calculate "means" of string clusters is not sufficiently answered, according to my survey efforts until now. A naive approach to attack this problem would be to combine k-Means clustering with Levenshtein distance, but the question still remains "How to represent "means" of strings?". There is a weight called as TF-IDF weight, but it seems that it is mostly related to the area of "text document" clustering, not for the clustering of single words. It seems that there are some special string clustering algorithms existing, like the one at http://pike.psu.edu/cleandb06/papers/CameraReady_120.pdf

My search in this area is going on still, but I wanted to get ideas from here as well. What would you do recommend in this case, is anyone aware of any methods for this kind of problem?

Best Answer

Seconding @micans recommendation for Affinity Propagation.

From the paper: L Frey, Brendan J., and Delbert Dueck. "Clustering by passing messages between data points." science 315.5814 (2007): 972-976..

It's super easy to use via many packages. It works on anything you can define the pairwise similarity on. Which you can get by multiplying the Levenshtein distance by -1.

I threw together a quick example using the first paragraph of your question as input. In Python 3:

import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
    
words = "YOUR WORDS HERE".split(" ") #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))

Output was (exemplars in italics to the left of the cluster they are exemplar of):

  • have: chances, edit, hand, have, high
  • following: following
  • problem: problem
  • I: I, a, at, etc, in, list, of
  • possibly: possibly
  • cluster: cluster
  • word: For, and, for, long, need, should, very, word, words
  • similar: similar
  • Levenshtein: Levenshtein
  • distance: distance
  • the: that, the, this, to, with
  • same: example, list, names, same, such, surnames
  • algorithm: algorithm, alogrithm
  • appear: appear, appears

Running it on a list of 50 random first names:

  • Diane: Deana, Diane, Dionne, Gerald, Irina, Lisette, Minna, Nicki, Ricki
  • Jani: Clair, Jani, Jason, Jc, Kimi, Lang, Marcus, Maxima, Randi, Raul
  • Verline: Destiny, Kellye, Marylin, Mercedes, Sterling, Verline
  • Glenn: Elenor, Glenn, Gwenda
  • Armandina: Armandina, Augustina
  • Shiela: Ahmed, Estella, Milissa, Shiela, Thresa, Wynell
  • Laureen: Autumn, Haydee, Laureen, Lauren
  • Alberto: Albertha, Alberto, Robert
  • Lore: Ammie, Doreen, Eura, Josef, Lore, Lori, Porter

Looks pretty great to me (that was fun).

Related Question