Solved – Clustering short strings to find similar strings within a dataset

I have a large dataset (ranging from 10k to 100k items) and some of them are almost duplicates.

What I'd like to do is to merge those almost duplicates by finding clusters of words in my dataset.

I've been browsing some answered questions and they provide a lot of infos that I'm finding hard to process, especially because the answers depend on problem's domain.

Let's say that a snippet of my dataset is the following (pairs are term,termFrequency):

forum, 1349.0
roman, 728.0
roman forum, 590.0
temple, 512.0
foro, 336.0
romanum, 300.0
forum romanum, 276.0
fori, 207.0
romulus, 206.0
faustina, 178.0
imperiali, 174.0
italyrome, 170.0
fori imperiali, 166.0
maxentius, 159.0
ancient, 137.0
church, 131.0
ruins, 131.0
antoninus, 126.0
damiano, 119.0
cosma, 115.0
arch, 115.0
hill, 115.0
santi, 114.0
santa, 104.0

What I'd expect is to have these clusters:

[forum, roman, roman forum, foro, romanum, forum romanum]
[temple]
[romulus]
[faustina]
[imperiali, fori, fori imperiali]
[italyrome]
[maxentius]
[ancient]
[church]
[ruins]
[antoninus]
[damiano]
[cosma]
[arch]
[hill]
[santi, santa]

I've already tried both DBSCAN and OpticsXI using Levensthein distance as metric, but this metric is not able to produce the first cluster so I can't use it.

The question is: is there something that I can use in Java to achieve the results above?

Best Answer

Problem:

The clusters that you expect will not result from only using the Levenshtein Distance, since

LD(fori , fori imperiali) = 10 and LD (fori, ruins) = 5

Therefore "fori" will more likely be in a cluster with "ruins" than with "fori imperiali".

Solution:

You could adjust your distance measure such as:

LD_adjusted (a, b) = LD (a, b) * ((a not in b) & (b not in a));

where (a not in b) is a Boolean variable which is 1 if true and 0 if false.

This would yield the normal Levenshtein distance for all pairs of a,b where a is not in b and b is not in a. Yet if (a in b) or (b in a), it would always yield 0, just as you require.

LD_adjusted (fori, fori imperiali) = 10 * (0 & 1) = 10 * 0 = 0

LD_adjusted (fori, ruins) = 5 * (1 & 1) = 5 * 1 = 5

With this adjusted distance measure, you should reach the result that you are asking for. Please provide feedback if this answer did help you.

Sources: I used this online caluculator for levensthein distance.

Best Answer

Related Solutions

Solved – How to compute term frequency and find clusters in a dataset composed of strings

Solved – Clustering short time series

Related Question