[Math] How to compute numerically the Earth mover’s distance (EMD)

probabilityprobability distributionsprobability theory

I was trying to compute numerically (write a program) that calculated the EMD for two probability distribution $p_X$ and $q_X$. However, I had a hard time finding an outline of how to exactly compute such a distance.

I was wondering if there existed a closed form equation for EMD or if there existed an outline of an algorithm to compute it.

Just like there is very compact equation for say, the KL-divergence $D(p_X||q_X) = \sum_{x \in X}p_X(x) log\frac{p_X(x)}{q_X(x)}$

Is there one such equation for EMD or an algorithm for EMD such that it can be easily numerically computed?

What I have found so far is the following, that $EMD(p_X,q_X)$ between distributions $p_X$ and $q_X$ is:

$$EMD(P,Q) = \frac{\sum^m_{i=1}\sum^n_{j=1} f_{ij}d_{ij}}{\sum^m_{i=1}\sum^b_{j=1} f_{ij}}$$

However, to find $d_{ij}$ one needs to define some distance measure I guess and to find $f_{ij}$ one needs to solve the transportation linear programming defined in:

http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/RUBNER/emd.htm#RUBNER98A

Isn't there a solution to the transportation problem such that there is an easy way of evaluating EMD?

Also, I found a wikipedia section to compute it but, the pseudo-code was to ambiguous for me to understand how to actually compute the EMD. If someone understands that section better and can explain it, it would awesome!

Here is the link

http://en.wikipedia.org/wiki/Earth_mover's_distance

or you can just see the pseudo-code right here:

The Pseudo-code/wikipedia section:

If the domain D is discrete, the EMD can be computed by solving an instance transportation problem, which can be solved by the so-called Hungarian algorithm. In particular, if D is a one-dimensional array of "bins" the EMD can be efficiently computed by scanning the array and keeping track of how much dirt needs to be transported between consecutive bins. For example:

$EMD_0 = 0 \\$

$EMD_{i+1} = ( A_i + EMD_i ) – B_i \\$

$TotalDistance = \sum | EMD_i | \\$

I guess what I don't understand is what $A_i$ and $B_i$ are and when the algorithm even stops running. Its just to unclear to me what its doing and I guess I don't understand EMD well enough to derive it myself (If I could I would!)


As an extension to my question, is it a problem if the sample space for the probability distributions is infinite?


I probably won't accept answer that are not as general as possible, but an answer clarifying what wikipedia article is trying to say, would definitively get an up vote.

Best Answer

The Wikipedia algorithm is correct, but a little terse. A simple example should clear it up if we stick to actual earth moving. Histogram $A$ is three piles of dirt (earth) in a row: $A_0$, $A_1$, and $A_2$, and histogram $B$ is three piles of dirt $B_0$, $B_1$, and $B_2$ in a row.

Suppose the number of shovelfuls in each dirt pile is as follows.

$A_0 = 3$, $A_1 = 2$, $A_2 = 1$

$B_0 = 1$, $B_1 = 2$, $B_2 = 3$

So to make $A$ look like $B$, we really need to take two shovelfuls of dirt from $A_0$ and put them in $A_2$; i.e., move two shovelfuls two units of distance: that's four units of work. The algorithm has us move the two shovelfuls from $A_0$ to $A_1$, and then two from $A_1$ to $A_2$, using $EMD_{i+1} = (A_i + EMD_i) - B_i$.

$$ EMD_0 = 0 \\ EMD_1 = (3 + 0) - 1 = 2 \\ EMD_2 = (2 + 2) - 2 = 2 \\ EMD_3 = (1 + 2) - 3 = 0 \\ \Rightarrow EMD = 2 + 2 + 0 = 4 $$

Related Question