[Math] Grouping numbers that are “close to each other”

algorithms

Let's say you have a set of numbers S and you want to obtain subsets of $S$, $s_1,s_2,\ldots, s_i$, where $i < N$.

Is there a particular operation that will group the numbers that are "close to each other"?

I will give an example to clarify the question:

Let's say you have a set $S$ which contains $\{1.4, 2, 2.5, 2.7, 14, 16, 49, 57, 58\}$

And let's say that you can have maximum of $N=4$ subsets.

So, you might end up with a result that looks like this:

$s_1 = \{1.4, 2, 2.5, 2.7\}$
$s_2 = \{14, 16\}$
$s_3 = \{49\}$
$s_4 = \{57, 58\}$

I am looking for either the name what such a problem would be called (which I can then use to research and write an algorithm). If you can provide a simple solution, even better.

Thank you for any help with this.

Best Answer

There are many solutions to this problem extant, each rediscovered many times. Search "statistical clustering methods" or "cluster analysis." A particularly nice one for your problem is called "K-means". This was rediscovered by cartographers (!) where it is called "Jenks' method" (and is worth mentioning because it's used by some popular mapping software, so perhaps more people have heard of this than of any other method). The idea behind K-means is to minimize the population-weighted sum of variances of the subsets.

BTW, the challenge in most clustering algorithms lies in determining the number of clusters; usually that needs to be specified (as $N = 4$ in this example) or at least hinted at by the user. For instance, there always exists a minimum K-means solution: partition $S$ into singletons. If you require $i < n$, merge the two nearest neighbors but leave all the other singletons as they are: the sum of variances is then just one-quarter the square of the smallest difference among the $x_i$.

Related Question