I'm working on a clustering algorithm to group similar objects that are represented by ranges of real numbers. Let's say that I have a group of people who are buying sugar. Each of them defines minimum and maximum amount he would be willing to buy. What I would like to do is to put in one cluster those people that want to buy similar amount. For example, one of the result clusters would be for people who need between 3 and 4 kg of sugar, while the other one might be between 6 and 8 kg. For this, I have to create a distance function which would tell how similar the requirements of the buyers are, i.e., what is the distance (or difference) between two ranges. What would be the best way to do this?
[Math] Distance between two ranges
clusteringeuclidean-geometrygeometry
Related Solutions
Define a distance function between data points and this becomes easier. Let $F_x(i)$ be the $i$th numerical feature and $D_x(i)$ be the $i$th nominal feature (as a one-hot vector) of data point $x$. Then the distance between data points $x$ and $y$ can be, for instance, $$ \delta(x,y) = \sum_i \gamma_i | F_x(i) - F_y(i) |^p + \eta\sum_j || D_x(j) - D_y(j) ||_2^2 $$ where we can choose $p,\gamma_i,\eta$ based on the data itself. For instance, we can choose $p=1$, $\eta=1/|D|$ as one over the number of nominal features, and $$ \gamma_i = \frac{1}{|F|\,|X|}\sum_{x\in X} F_x(i) $$ as the weight for numerical feature $i$, for the dataset $X$, so that the relative contribution of each term is similar in magnitude.
Then, given two clusters $C_1$ and $C_2$, there are many ways to compute normalized similarity. One is just $$ S(C_1,C_2) = \frac{1}{1+\Delta(C_1,C_2)},\;\;\text{where}\;\; \Delta(C_1,C_2) = \frac{1}{|C_1|\,|C_2|} \sum_{x\in C_1} \sum_{y\in C_2} \delta(x,y) $$ so that we get a similarity of $1$ when the clusters are identical and something close to $0$ when they are very different. Another, for instance, is $S_e(C_1,C_2)=\exp(-\Delta(C_1,C_2))$.
Alternatively, we could replace each $D_x(\ell)$ with a one-hot vector, and "unfold" each data point into a vector of numbers $\vec{x}$. Then we could compute a similarity via $$ \tau_c(\vec{x},\vec{y}) = \frac{\vec{x}\cdot\vec{y}}{||\vec{x}||_2\,||\vec{y}||_2} $$ which measures the angle between the unitized vectors in the data space. This is the cosine similarity, so $\tau_c\in[-1,1]$. (Note that no attempt is made to account for the magnitude similarities across dimensions.) Then we can measure overall similarity via $$ S_c(C_1,C_2) = \frac{1}{|C_1|\,|C_2|} \sum_{x\in C_1} \sum_{y\in C_2} \frac{\tau_c(\vec{x},\vec{y}) + 1}{2} $$ which is $0$ for very different clusters and $1$ for very close ones.
These two clustering rules do not necessarily produce the same results
Suppose you have the following four points and you want $k=2$ clusters:
- $A: (33,0)$
- $B: (-33,0)$
- $C: (0,56)$
- $D: (0,124)$
What would the two clusters be?
With method $1$ you would cluster $\{A,B\}$ with a circle of diameter $66$ centred at $(0,0)$ and $\{C,D\}$ with circle of diameter $68$ centred at $(0,90)$, making the maximum diameter $68$. You might see that you are not going to do better than this by considering circles centred for example at the point $(0,18)$ near the centre of the smallest circle containing $A,B,C$ though that centre is more than $35$ away from all four points making any circle containing at least one of them have a diameter over $70$
With method $2$ you would cluster $\{A,B,C\}$ since they are all $65$ or $66$ away from each other, leaving $\{D\}$ for the other cluster since $D$ is at least $68$ away from any of the others
Best Answer
One way to do this would be to consider the Hausdorff distance. Given two subsets $A$, $B$ of a metric space $X$ (for example, two intervals in $\mathbb{R}$) the Hausdorff distance between them is the largest value out of
$$\sup_{a\in A}\, \inf_{b\in B}\, d(a,b)$$
$$\sup_{b\in B}\, \inf_{a\in A}\, d(a,b)$$
That is, first we consider a point $a \in A$, find the least distance to a point $b\in B$, and maximise this over $A$. Then we do the same thing with the roles of $A$ and $B$ reversed. Finally we take the largest of these two values.
Under the Hausdorff metric, the set of non-empty compact subsets of $X$ into a metric space (so, for example, all distances are non-negative, if the distance is zero then the sets coincide, and distances satisfy the triangle inequality).
How does this work in your case? Let's write $A=[a_1,a_2]$ and $B=[b_1,b_2]$ and take a special case of $a_1<a_2<b_1<b_2$, so that the intervals are non-overlapping and every $a\in A$ is less than every $b\in B$. Then the first expression has the value $b_1-a_1$ and the second has the value $b_2-a_2$, so the distance between the intervals is
$$d(A,B)=\max(b_1-a_1, b_2-a_2)$$
How about if $B$ is contained inside $A$, so we have $a_1<b_1<b_2<a_2$? Then the first quantity is $\max(b_1-a_1, a_2-b_2)$ and the second quantity is $0$, so the distance between the intervals is
$$d(A,B)=\max(b_1-a_1, a_2-b_2)$$
You can proceed through all possible cases in this way. In fact, my conjecture is that for any two such intervals, we always have
$$d(A,B) = \max(|a_1-b_1|, |a_2-b_2|)$$
although my enthusiasm doesn't quite stretch to working through all the cases. Shouldn't be too hard to do it yourself though!
More information: http://en.wikipedia.org/wiki/Hausdorff_distance