Why doesn’t Scipy’s wasserstein_distance use linear programming

linear programmingoptimal-transportprogrammingpythonstatistics

Question

Kantorovich's formulation of the optimal transport problem using the Wasserstein distance is

$$\begin{align}\begin{aligned}W_p(a,b)=(\min_\gamma \sum_{i,j}\gamma_{i,j}\|x_i-y_j\|_p)^\frac{1}{p}\\s.t. \gamma 1 = a; \gamma^T 1= b; \gamma\geq 0\end{aligned}\end{align}$$

It is often taught that this is solved using linear programming:

$$\begin{align}\begin{aligned}OT(a,b)=\min_\gamma \quad \sum_{i,j}\gamma_{i,j}M_{i,j}\\s.t. \gamma 1 = a; \gamma^T 1= b; \gamma\geq 0\end{aligned}\end{align}$$

How then is the function scipy.stats.wasserstein_distance able to solve Wasserstein/OT without linear programming? What approach/method is the function using?

My guess

I am aware that wasserstein_distance is for 1D distributions only, and there so happens to be a closed-form analytical solution in the 1D case which is

$$\mathcal{W}_{p}(\mu, \nu) =\left(\int_{0}^{1}\left|F_{\mu}^{-1}(z)-F_{\nu}^{-1}(z)\right|^{p} \mathrm{d} z\right)^{\frac{1}{p}}$$

The formula above contains inverse probability functions $F^{-1}$. The code, shown next, however, does not contain inverse probability functions from what I can see.

Code

I know this sounds like a stackoverflow question, but it's not (and I need to write the math formulas): I'm trying to figure out what formula/approach is being used by the code since the code does not resemble anything above: It doesn't use linear programming, otherwise we would see scipy.optimize.linprog somewhere, and it doesn't use inverse probability, otherwise we would see .ppf somewhere:

def wasserstein_distance(u_values, v_values, u_weights=None, v_weights=None):
    r"""
    Compute the first Wasserstein distance between two 1D distributions.
    This distance is also known as the earth mover's distance, since it can be
    seen as the minimum amount of "work" required to transform :math:`u` into
    :math:`v`, where "work" is measured as the amount of distribution weight
    that must be moved, multiplied by the distance it has to be moved.
    .. versionadded:: 1.0.0
    Parameters
    ----------
    u_values, v_values : array_like
        Values observed in the (empirical) distribution.
    u_weights, v_weights : array_like, optional
        Weight for each value. If unspecified, each value is assigned the same
        weight.
        `u_weights` (resp. `v_weights`) must have the same length as
        `u_values` (resp. `v_values`). If the weight sum differs from 1, it
        must still be positive and finite so that the weights can be normalized
        to sum to 1.
    Returns
    -------
    distance : float
        The computed distance between the distributions.
    Notes
    -----
    The first Wasserstein distance between the distributions :math:`u` and
    :math:`v` is:
    .. math::
        l_1 (u, v) = \inf_{\pi \in \Gamma (u, v)} \int_{\mathbb{R} \times
        \mathbb{R}} |x-y| \mathrm{d} \pi (x, y)
    where :math:`\Gamma (u, v)` is the set of (probability) distributions on
    :math:`\mathbb{R} \times \mathbb{R}` whose marginals are :math:`u` and
    :math:`v` on the first and second factors respectively.
    If :math:`U` and :math:`V` are the respective CDFs of :math:`u` and
    :math:`v`, this distance also equals to:
    .. math::
        l_1(u, v) = \int_{-\infty}^{+\infty} |U-V|
    See [2]_ for a proof of the equivalence of both definitions.
    The input distributions can be empirical, therefore coming from samples
    whose values are effectively inputs of the function, or they can be seen as
    generalized functions, in which case they are weighted sums of Dirac delta
    functions located at the specified values.
    References
    ----------
    .. [1] "Wasserstein metric", https://en.wikipedia.org/wiki/Wasserstein_metric
    .. [2] Ramdas, Garcia, Cuturi "On Wasserstein Two Sample Testing and Related
           Families of Nonparametric Tests" (2015). :arXiv:`1509.02237`.
    Examples
    --------
    >>> from scipy.stats import wasserstein_distance
    >>> wasserstein_distance([0, 1, 3], [5, 6, 8])
    5.0
    >>> wasserstein_distance([0, 1], [0, 1], [3, 1], [2, 2])
    0.25
    >>> wasserstein_distance([3.4, 3.9, 7.5, 7.8], [4.5, 1.4],
    ...                      [1.4, 0.9, 3.1, 7.2], [3.2, 3.5])
    4.0781331438047861
    """
    return _cdf_distance(1, u_values, v_values, u_weights, v_weights)

def _cdf_distance(p, u_values, v_values, u_weights=None, v_weights=None):
    r"""
    Compute, between two one-dimensional distributions :math:`u` and
    :math:`v`, whose respective CDFs are :math:`U` and :math:`V`, the
    statistical distance that is defined as:
    .. math::
        l_p(u, v) = \left( \int_{-\infty}^{+\infty} |U-V|^p \right)^{1/p}
    p is a positive parameter; p = 1 gives the Wasserstein distance, p = 2
    gives the energy distance.
    Parameters
    ----------
    u_values, v_values : array_like
        Values observed in the (empirical) distribution.
    u_weights, v_weights : array_like, optional
        Weight for each value. If unspecified, each value is assigned the same
        weight.
        `u_weights` (resp. `v_weights`) must have the same length as
        `u_values` (resp. `v_values`). If the weight sum differs from 1, it
        must still be positive and finite so that the weights can be normalized
        to sum to 1.
    Returns
    -------
    distance : float
        The computed distance between the distributions.
    Notes
    -----
    The input distributions can be empirical, therefore coming from samples
    whose values are effectively inputs of the function, or they can be seen as
    generalized functions, in which case they are weighted sums of Dirac delta
    functions located at the specified values.
    References
    ----------
    .. [1] Bellemare, Danihelka, Dabney, Mohamed, Lakshminarayanan, Hoyer,
           Munos "The Cramer Distance as a Solution to Biased Wasserstein
           Gradients" (2017). :arXiv:`1705.10743`.
    """
    u_values, u_weights = _validate_distribution(u_values, u_weights)
    v_values, v_weights = _validate_distribution(v_values, v_weights)

    u_sorter = np.argsort(u_values)
    v_sorter = np.argsort(v_values)

    all_values = np.concatenate((u_values, v_values))
    all_values.sort(kind='mergesort')

    # Compute the differences between pairs of successive values of u and v.
    deltas = np.diff(all_values)

    # Get the respective positions of the values of u and v among the values of
    # both distributions.
    u_cdf_indices = u_values[u_sorter].searchsorted(all_values[:-1], 'right')
    v_cdf_indices = v_values[v_sorter].searchsorted(all_values[:-1], 'right')

    # Calculate the CDFs of u and v using their weights, if specified.
    if u_weights is None:
        u_cdf = u_cdf_indices / u_values.size
    else:
        u_sorted_cumweights = np.concatenate(([0],
                                              np.cumsum(u_weights[u_sorter])))
        u_cdf = u_sorted_cumweights[u_cdf_indices] / u_sorted_cumweights[-1]

    if v_weights is None:
        v_cdf = v_cdf_indices / v_values.size
    else:
        v_sorted_cumweights = np.concatenate(([0],
                                              np.cumsum(v_weights[v_sorter])))
        v_cdf = v_sorted_cumweights[v_cdf_indices] / v_sorted_cumweights[-1]

    # Compute the value of the integral based on the CDFs.
    # If p = 1 or p = 2, we avoid using np.power, which introduces an overhead
    # of about 15%.
    if p == 1:
        return np.sum(np.multiply(np.abs(u_cdf - v_cdf), deltas))
    if p == 2:
        return np.sqrt(np.sum(np.multiply(np.square(u_cdf - v_cdf), deltas)))
    return np.power(np.sum(np.multiply(np.power(np.abs(u_cdf - v_cdf), p),
                                       deltas)), 1/p)

Best Answer

A more complete explanation for the algorithm used is in Remark 2.28 of the book Computational Optimal Transport. But a brief explanation of how the algorithm works is the following:

We are dealing with a 1-D case with two Discrete distributions where we want to transport $\alpha$ to $\beta$. Now, you sort the distribution and from left to right, you move each mass of $\beta$ to the closest mass of $\alpha$ until that mass is completely transported. The figure is from the book and illustrates this.

Related Solutions

[Math] 2-Wasserstein distance between empirical distributions

I solved the problem using the MOSEK solver and got the similar results.

K = 600 
D=25 
train = np.random.multivariate_normal(np.zeros(D), np.diag(np.ones(D)*1),size=K).T 
test = np.random.multivariate_normal(np.zeros(D), np.diag(np.ones(D)*4),size=K).T 
M = np.zeros((K,K)) 
for i in range(K):
        for j in range(K):
            M[i,j] = np.linalg.norm(train[:,i]-test[:,j],2) 
Tra = cp.Variable((K,K),nonneg=True) 
constraint = [Tra @ np.ones((K,1)) == np.ones((K,1))/K,
                  Tra.T @ np.ones((K,1)) == np.ones((K,1))/K] 
optprob = cp.Problem(cp.Minimize(cp.trace(Tra.T@M)), constraint) 
optprob.solve(solver=cp.MOSEK) 
print(np.trace(Tra.value.T@M))

The result is around 8.3 for 25 dimension which is not surprising since for the continuous random variables, the Wasserstein distance is a continous optimization problem over all the possible joint probability distributions; while for the empirical distribution, we in fact optimize over matirx $T$, which obviously has a higher optimal objective value than the continous case.

If a type is an object and a function is a morphism. How to interpret a value in programming

$\require{AMScd}$ $\DeclareMathOperator\apply{apply}$ $\DeclareMathOperator\fmap{fmap}$ $\DeclareMathOperator\Tuple{Tuple}$

The fields x and y of the struct can be interpreted as functions x :: F{T} -> T and y :: F{T} -> T. These function symbols witness that $F T$ is isomorphic in the category $\text{Types}$ of types to the product $T \times T$, with projections $x$ and $y$.

In general a struct $F\{T_1,\ldots,T_m\}$ with fields $a_1,\ldots,a_n$, where $a_i: F\{T_1,\ldots,T_m\} \to T_i$ witnesses the type $F\{T_1,\ldots,T_m\}$ as the product $T_1 \times \ldots \times T_n$ (where we may have repeated objects in this product). The value constructor is the inverse of this isomorphism: $F: T_1 \times \ldots \times T_n \to F\{T_1, \ldots, T_m\}$.

Since the $T_i$ are type variables, we could think of $T_1 \times \ldots \times T_n$ as a functor $\Tuple: \text{Types} \times \ldots \times \text{Types} \to \text{Types}$. Then the value constructor looks like the component of a natural transformation $F: \Tuple\{-\} \Rightarrow F\{-\}$. For the struct in your question, this is indeed a natural transformation because the definition of fmap is evaluated "pointwise". However, this is not generally true, even in the 1-ary case, explained below.

Given a 1-ary functor $F: \text{Types} \to \text{Types}$, the constructor is a function $F^{new}_T: T \to F\{T\}$, but in programming languages (like Haskell, and apparently Julia), the notation is simplified, yielding what we typically see for a constructor:

F :: T -> F{T}
-- Or even:
pure :: T -> F T

This looks like the component of a natural transformation $F^{new}: \text{id} \Rightarrow F$ (where $\text{id}: \text{Types} \to \text{Types}$ is the identity functor), but the statement that $F^{new}$ is natural is the statement that the following diagram commutes for any choice of function $f: A \to B$ in the category of $\text{Types}$: $$ \begin{CD} A @>{f}>> B\\ @V{F^{new}_A}VV @VV{F^{new}_B}V\\ F\{A\} @>{\text{fmap}(f)}>> F\{B\} \end{CD} $$ Commutativity of this diagram is the following identity, for any term x :: A.

fmap(f,F(x)) == F(fmap(f,x))

Not all functors satisfy this identity. However, if $F$ does satisfy this identity for any type—in a compatible way—then we say (at least in Haskell) that $F$ is an applicative (functor).

An applicative is a triplet $(F,F^{new},\apply)$ where $F$ is a functor, $F^{new}$ is the constructor, and $\apply: F(Y^X) \to FY^{FX}$ is a collection of arrows, where $Y^X$ is the type of functions $f : X \to Y$. This triplet has to satisfy compatibility conditions, for any terms x :: X, y :: F{X}:

$\fmap(f, y) = \apply(F^{new}(f),y)$
$\fmap(1_X,y) = y$.
$\fmap(f, F^{new}(x)) = F^{new}(f(x))$
Compatible with Currying (there is an equation for this)
Compatible with function composition (there is an equation for this)

In particular, condition (2) is the statement that $F^{new} : \text{id} \Rightarrow F$ is a natural transformation.

In pure mathematical terminology: the category $\text{Types}$ in a programming language is (modulo side-effects) a Cartesian-Closed category. A functor class in the language corresponds to an endofunctor $F : \text{Types} \to \text{Types}$. If this functor is a strong lax monoidal functor, then $F$ can be given the structure of an applicative functor (c.f. Haskell).