I have a list of strings that I have extracted from a large database of strings. These "blacklisted" strings have been removed but I also want nothing similar to them present in the database as well.
Similarity in this case can apply to either:
-
Spelling Mistakes E.g if "big butt" is a blacklisted term, remove "bigg but" as well
-
Similar looking terms E.g if "big butt" is a blacklisted term, remove "big but" as well
I determined that clustering is a good way of identifying strings based on their similarity, with the way of determining distance being done using "Levensthein distance" to render their similarity as a number. The only problem is a lot of the approaches ( such as Hierarchal and K-Means) I have seen (new to machine learning) only has one dataset being considered.
I have written a python script that can compute the Levensthein Distance between two strings and use that to cluster a set of data using Affinity Propagation (based off of the answer here) but this approach is limited.
Is there a machine-learning approach that can extract the data I want. Is it possible to cluster a dataset based off a different dataset?
Best Answer
Sorry, this will not work.
"fog" and "fag" are very similar, and have a levenshtein distance of 1.
"duck" is similar to d-ck, clock to c-ck... butt to "but".
So if you blacklist "fag", your approach would also prevent "fog".
Don't do this. If you want spelling correction, use a dictionary-based spell checker with care.