Solved – Normalizing edit distance on strings

clusteringdistancedistance-functionstext mining

I am going to run a clustering algorithm on strings (sequences of characters). I would like to use the edit distance, but it seems to be misleading as I perceive Anna to be closer to Anne than A to B. Both have edit distance 2.

My idea would be to normalize the edit distance by sum of lengths of both strings. In that case, we would have $2/8$ for Anna to Anne; and for A to B we would have $2/2$.

On the other hand A to B seems to be closer Adam to Paris. Is there any class of distance metrics that could manage the trade-off. Namely

d('Anna','Anne') < d('A','B') <= d('Adam','Paris')

Best Answer

A fairly common "normalized" Levenshtein version that works just like that exists.

I don't know any reference. Dividing by the length is probably so obvious that nobody considers this worth publishing. It's simply referred to as "normalized Levenshtein distance".