Ordinal Data – Understanding Earth Mover (Wasserstein) Distance in Ordinal Discrete Data

distancelikertordinal-datawasserstein

I am doing data analysis for my Masters research and which includes some Likert scale type questions. I have been calculating some distances between the responses for these questions. All this has gone well with the exception of calculating the Earth Mover distance. I understand the intuition behind the Earth Mover distance ('moving dirt to holes') but am stumped on how the calculate it.

If I have the following ordinal data from two 5-point Likert scale type questions, $F= {\{}15,10,17,13,12{\}}$ and $M ={\{}11,8,12,21,13{\}}$, how would I go about calculating the Earth Mover Distance? (Preference for answers not using R – I'd like to understand how this distance is calculated first before using R).

Best Answer

The Earth Mover Distance (EMD) (aka Wasserstein distance or metric) is a comparison of probability distributions (e.g. see https://doi.org/10.1137/1118101). The inference with the EMD is that the total quantity of the dirt to move is equal the space (i.e. the ‘holes’) to receive the dirt – hence there is no residual dirt remaining. This can be best achieved by converting the raw data into the respective proportions.

Firstly, calculate the sum of each vector where $\sum{F}=67$ and $\sum{M}=65$. For each vector, we divide each component by the respective sum. This gives us $f = \{0.224, 0.149, 0.254, 0.194, 0.179 \}$ and $m = \{0.169, 0.123, 0.185, 0.323, 0.200 \}$.

Using the Hungarian algorithm we can calculate the EMD as follows (Note: I have aligned the notation to your example and simplified some ambiguity). This is an iterative calculation where we start off by setting $d_0 = 0$, then apply the following steps:

$d_{i+1} = f_i + d_i – m_i$

$EMD=\sum_{i=0}^{n} |d_i| $

As you requested not using R, below is your data using Excel. The first screen snip is the data and results; the second shows the formula used. enter image description here

enter image description here

The EMD for your data is $\mathbf{0.3063}$.

Note that columns F and G look the same. Given your data, $d$ is positive for all $i$. However, for other data, $d_i$ can be negative and the summation requires absolute values (column G).

The lower bound of the EMD is zero when ${f_i = m_i}$ for all $i$. The upper bound is equal to ${n – 1}$, hence one could normalise the EMD between $\{0,1\}$ by:

$EMD_n=\frac{EMD}{n-1} $

The normalised EMD for your data is $\mathbf{0.0766}$.

Related Question