Solved – a metric can I use to calculate the distance between labels

Let's say we have a set of labels of the same length, and we need to find the distance between them.

In the case of binary labels, one can use the Hamming distance. For example, if $l_1 = 01101$ and $l_2 = 00111$, then $d(l_1, l_2) = 2$.

In my case, labels are formed from the alphabet $A=\{a, b, c, d, e\}$, so the length of the alphabet is $|A|=5$, and the length of each label is $n=4$.

In my case, an ordinal scale is applicable for letters from alphabet $A$:

$$a < b < c < d < e.$$

Examples of labels: deed, aaaa, aaad, aaae, dada, cccd.

Edit. The Hemming distance for three labels aaaa, aaad, aaae gives $$d(aaaa, aaad) = d(aaaa, aaae)$$ but I am looking for a metric which will distinguish $d$ and $e$ and return $$d(aaaa, aaad)<d(aaaa, aaae)$$ because $d<e$.

Edit 2.

For creating a label we use a threshold $T \in \mathbf{R}$ and apply the next function for the $i$-th element of $X=(x_1, x_2, \ldots, x_n)$:
\begin{equation}
f(x_i) =
\begin{cases}
a, & x_i \leq -T, \\
b, & -T < x_i \leq 0, \\
c, & x_i = 0, \\
d, & 0 < x_i \leq T, \\
e, & x_i >T. \
\end{cases}
\end{equation}
Finally, we use the concatination operator $\&$, for example, $a \& a \& a \& a= aaaa$.

Question. What a metric can I use to calculate the distance between labels?

Best Answer

It really depends on what kind of words you are referring to. There are two distance that I wish to talk about :

Edit Distance

If you wish to capture difference in terms of how different two sequence are, you can use levenshtein distance or Damerau-Levenshtein distance. Mathematically for a word $A$ or $B$, the levenshtein distance is the least number of moves/operations to transform word $A$ to word $B$. This is what you might be looking for when your definition of word as a sequence of alphabet.
Context Similarity

For words we can also talk about contextual meaning of each word. If the two words are related or have similar meaning then we expect this measure to be small. This can be implemented with word2vec. Basically we train our model in unsupervised manner and will have it's vectorized representation and we measure the distance by comparing the two vectors. The most popular way for measuring the distance is using cosine similarity.

The two distances does not correlate with each other. For example, deed and deer, the edit distance is small (in fact it is equals to 1), but the similarity distance will be big since those words are not related.

Edit : Since the asker explained his specific case.

You can consider using Earth Mover's/Wasserstein distance.

This is my idea how you might approach this. Suppose you wish to imply ordering for each letter such that $a < b < c < d < e$ and you have 3 words on your letter. Suppose you have a word $abc$, for 3 letters let $t_1=0,t_2=1,t_3=2$, and $w_1=a,w_2=b,w_3=c$. Also let $T$ be a key value mapping between letter and some arbitrary value and should reflect your ordering.

\begin{equation} f(x) = \begin{cases} T(w_i), & t_i \leq x < t_{i+1}, \\ 0, & otherwise\\ \end{cases} \end{equation} Forgive my poor use of notations though, but the idea is (if we let $T(a)=0.8$ and $T(b)=1.5$) for example you have a function that have value $f(x)=0.8$ for $x\in[0,1]$ and then value of $f(x)=1.5$ for $x\in[1,2]$. Now you can see this as an unnormalized distribution and you could calculate earthmovers/wasserstein distance. This is just some random idea, might not necessarily make sense though.

Here is a useful link.

Best Answer

Related Solutions

Solved – Using k-means with other metrics

Related Question