[Math] Distance between two ranges

clusteringeuclidean-geometrygeometry

I'm working on a clustering algorithm to group similar objects that are represented by ranges of real numbers. Let's say that I have a group of people who are buying sugar. Each of them defines minimum and maximum amount he would be willing to buy. What I would like to do is to put in one cluster those people that want to buy similar amount. For example, one of the result clusters would be for people who need between 3 and 4 kg of sugar, while the other one might be between 6 and 8 kg. For this, I have to create a distance function which would tell how similar the requirements of the buyers are, i.e., what is the distance (or difference) between two ranges. What would be the best way to do this?

Best Answer

One way to do this would be to consider the Hausdorff distance. Given two subsets $A$, $B$ of a metric space $X$ (for example, two intervals in $\mathbb{R}$) the Hausdorff distance between them is the largest value out of

$$\sup_{a\in A}\, \inf_{b\in B}\, d(a,b)$$

$$\sup_{b\in B}\, \inf_{a\in A}\, d(a,b)$$

That is, first we consider a point $a \in A$, find the least distance to a point $b\in B$, and maximise this over $A$. Then we do the same thing with the roles of $A$ and $B$ reversed. Finally we take the largest of these two values.

Under the Hausdorff metric, the set of non-empty compact subsets of $X$ into a metric space (so, for example, all distances are non-negative, if the distance is zero then the sets coincide, and distances satisfy the triangle inequality).


How does this work in your case? Let's write $A=[a_1,a_2]$ and $B=[b_1,b_2]$ and take a special case of $a_1<a_2<b_1<b_2$, so that the intervals are non-overlapping and every $a\in A$ is less than every $b\in B$. Then the first expression has the value $b_1-a_1$ and the second has the value $b_2-a_2$, so the distance between the intervals is

$$d(A,B)=\max(b_1-a_1, b_2-a_2)$$

How about if $B$ is contained inside $A$, so we have $a_1<b_1<b_2<a_2$? Then the first quantity is $\max(b_1-a_1, a_2-b_2)$ and the second quantity is $0$, so the distance between the intervals is

$$d(A,B)=\max(b_1-a_1, a_2-b_2)$$

You can proceed through all possible cases in this way. In fact, my conjecture is that for any two such intervals, we always have

$$d(A,B) = \max(|a_1-b_1|, |a_2-b_2|)$$

although my enthusiasm doesn't quite stretch to working through all the cases. Shouldn't be too hard to do it yourself though!


More information: http://en.wikipedia.org/wiki/Hausdorff_distance

Related Question