Solved – Clustering a database of strings based on their similarity to a seperate set of words

clusteringdatasetpython

I have a list of strings that I have extracted from a large database of strings. These "blacklisted" strings have been removed but I also want nothing similar to them present in the database as well.

Similarity in this case can apply to either:

  1. Spelling Mistakes E.g if "big butt" is a blacklisted term, remove "bigg but" as well

  2. Similar looking terms E.g if "big butt" is a blacklisted term, remove "big but" as well

I determined that clustering is a good way of identifying strings based on their similarity, with the way of determining distance being done using "Levensthein distance" to render their similarity as a number. The only problem is a lot of the approaches ( such as Hierarchal and K-Means) I have seen (new to machine learning) only has one dataset being considered.

I have written a python script that can compute the Levensthein Distance between two strings and use that to cluster a set of data using Affinity Propagation (based off of the answer here) but this approach is limited.

Is there a machine-learning approach that can extract the data I want. Is it possible to cluster a dataset based off a different dataset?

Best Answer

Sorry, this will not work.

"fog" and "fag" are very similar, and have a levenshtein distance of 1.

"duck" is similar to d-ck, clock to c-ck... butt to "but".

So if you blacklist "fag", your approach would also prevent "fog".

My data is big but the program does not work.

Don't do this. If you want spelling correction, use a dictionary-based spell checker with care.