I am going to run a clustering algorithm on strings (sequences of characters). I would like to use the edit distance, but it seems to be misleading as I perceive Anna
to be closer to Anne
than A
to B
. Both have edit distance 2.
My idea would be to normalize the edit distance by sum of lengths of both strings. In that case, we would have $2/8$ for Anna
to Anne
; and for A
to B
we would have $2/2$.
On the other hand A
to B
seems to be closer Adam
to Paris
. Is there any class of distance metrics that could manage the trade-off. Namely
d('Anna','Anne') < d('A','B') <= d('Adam','Paris')
Best Answer
A fairly common "normalized" Levenshtein version that works just like that exists.
I don't know any reference. Dividing by the length is probably so obvious that nobody considers this worth publishing. It's simply referred to as "normalized Levenshtein distance".