Solved – Normalizing edit distance on strings

clusteringdistancedistance-functionstext mining

I am going to run a clustering algorithm on strings (sequences of characters). I would like to use the edit distance, but it seems to be misleading as I perceive Anna to be closer to Anne than A to B. Both have edit distance 2.

My idea would be to normalize the edit distance by sum of lengths of both strings. In that case, we would have $2/8$ for Anna to Anne; and for A to B we would have $2/2$.

On the other hand A to B seems to be closer Adam to Paris. Is there any class of distance metrics that could manage the trade-off. Namely

d('Anna','Anne') < d('A','B') <= d('Adam','Paris')

Best Answer

A fairly common "normalized" Levenshtein version that works just like that exists.

I don't know any reference. Dividing by the length is probably so obvious that nobody considers this worth publishing. It's simply referred to as "normalized Levenshtein distance".

Related Solutions

Solved – Log-likelihood distance measure validity for clustering

What definition of log-likelihood is that? I've seen $$r(a,b) = \log \frac{P(a|Mod)}{P(b|Mod)} = \log(P(a|Mod)) - \log(P(b|Mod)) ,$$ but here you're subtracting your two probabilities.

Solved – Using k-means with other metrics

It's not as if k-means will necessarily blow up and fail if you use a different metric.

In many cases it will return some result. It is just not guaranteed that it finds the optimum centroids or partitions with other metrics, because the mean may not be suitable for minimizing distances.

Consider Earth movers distance. Given the three vectors

3 0 0 0 0
0 0 3 0 0
0 0 0 0 3

The arithmetic mean is

1 0 1 0 1

which has EMD distances 6, 4, 6 (total 16). If the algorithm had instead used

0 0 3 0 0

the EMD distances would have been 6, 0, 6; i.e. better (total 12).

The arithmetic mean does not minimize EMD, and the result of using k-means (with artihmetic mean) will not yield optimal representatives.

Similar things will hold for edit distances.

Best Answer

Related Solutions

Solved – Log-likelihood distance measure validity for clustering

Solved – Using k-means with other metrics

Related Question