Computing a “Generalized” Sinkhorn distance between two discrete probability distributions: A bi-convex optimization model

constraintsconvex optimizationinformation theoryoptimizationprobability

Given two discrete probability distributions $\mu = (\mu_1, \mu_2, \ldots, \mu_m)$ on $X = \{x_1, x_2, \ldots, x_m\}$ and $\nu = (\nu_1, \nu_2, \ldots, \nu_n)$ on $Y = \{y_1, y_2, \ldots, y_n\}$, the Sinkhorn distance is

\begin{align}
d_{\alpha}(\mu, \nu, X, Y) = \min_{P \in \mathbb{R}_+^{m \times n}} \sum_{i=1}^{m} \sum_{j=1}^{n} (x_i – y_j)^2 p_{ij},
\end{align}

where the minimization is subject to the constraints:

\begin{cases}
\sum_{j=1}^{n} p_{ij} &= \mu_i \quad \text{for all } i = 1, \ldots, m, \\
\sum_{i=1}^{m} p_{ij} &= \nu_j \quad \text{for all } j = 1, \ldots, n, \\
p_{ij} &\geq 0 \quad \text{for all } i = 1, \ldots, m \text{ and } j = 1, \ldots, n, \\
h(\mu) + h(\nu) – h(P) &\leq \alpha,
\end{cases}

where $h(\cdot)$ stands for entropy and $h(\mu) + h(\nu) – h(P)$ is also known as the mutual information. More concretely,

\begin{equation}
h(\mu) + h(\nu) – h(P) = -\sum_{i=1}^m\mu_i\log(\mu_i) – \sum_{i=1}^n\nu_i\log(\nu_i) – \sum_{i=1}^n\sum_{j=1}^m p_{ij}\log(p_{ij}).
\end{equation}

As proposed by Lightspeed Computation of Optimal Transport, this problem can be recast into a Lagrange multiplier and solved with Sinkhorn's matrix scaling algorithm.
Now, I wonder if there is an algorithm that can compute the following "generalized" Sinkhorn distance, where $y_i$ are optimization variables instead of constants, subject to the same constraints as above?

\begin{align}
d_{\alpha}(\mu, \nu, X) = \min_{P \in \mathbb{R}_+^{m \times n}, Y \in \mathbb{R}^n} \sum_{i=1}^{m} \sum_{j=1}^{n} (x_i – y_j)^2 p_{ij},
\end{align}

Maybe I can optimize $P$ and $Y$ alternatively (both are convex problems) or do ADMM (how exactly?), but perhaps there is an algorithm that converges faster?

Additional Information:

  1. This problem stems from the computation of the alphabet-constrained distortion-rate function, hence the information-theory tag. In particular, $\mu$ and $X$ (resp. $\nu$ and $Y$) stand for the distribution and support of the source (resp. reproduction) random variable, $\alpha$ stands for the rate, and the objective is the MSE distortion. We allow $y_i$ to be variables because the entries of the reproduction alphabet can be anything.
  2. $m \gg n$. This is because the reproduction alphabet is implemented as a lookup table and we don't want it to take too much space.
  3. Both $\mu$ and $\nu$ are close to uniform, because it's easier to generate uniform pseudorandom numbers. In fact, you may assume $\mu$ and $\nu$ are discrete uniform distributions on $X$ and $Y$ respectively.

Best Answer

Here I provide some observations that can help you to study your problem.

The problem in the OP is the minimization of a bi-convex function over a convex set. Bi-convex optimization problems form a nice class of global optimization problems, well studied for serval decades; see the 2007 review paper for an introduction.

An efficient heuristic method for solving bi-convex optimization models is to use Alternate Convex Search (ACS), a special case of block-relaxation methods, which converges (but not necessarily to a global optimum) under some conditions; see Section 4.2.1 in [2]. The ACS solves a sequence of convex optimization problems by alternatively fixing some variables. The problem that is solved with respect to $Y$ when $P$ is fixed, is a convex unconstrained quadratic optimization problem that can be solved explicitly ($Y^*$ can be found).

As your problem is bounded from below according to Theorem 4.5, the sequence of objective functions generated by ACS converges monotonically. Moreover, as $\min_{i\in[m]}x_i \le y*_i \le \max_{i\in[m]} x_i$ for any $i$, according to Theorem 4.7 in [2], the ACS always returns a partial optimum (which may not the global optimum).

Theoretically, any global optimum can be generated by the ASC for some good starting point. Therefore, a multistart version of ACS can be used to find the global optimum of a biconvex minimization problem by ACS [3].

You can uniformly generate random initial start points $y_0$ from the box $\left [ \min_{i\in[n]}x_i, \max_{i\in[n]} x_i \right ]^n$, and run the ASC method as much as you can to find a good solution.

As in your problem, the objective function is only bi-convex, with a special structure, and the feasible set is convex and is the Cartesian product of the feasible sets of $Y$ and $P$, a numerical study may lead us to guess some properties of the global optimum (for example you can numerically check the guess that any partial optimum of this model is a global one).

The Global Optimization Algorithm (GOA) is an exact method for solving bi-convex optimization, which is unfortunately inefficient in large scales; see Section 4.2.2 in [2].


Moreover, your model can be written as (compare with (13)):

$$\min \, 0 + \sum_{i=1}^{m} \sum_{j=1}^{n}z_{ij} p_{ij} +0$$

subject to

$$ (x_i - y_j)^2 \le z_{ij}, P \in [0,1]^{m \times n}, Z \in \left [0, \left (\max_{i\in[n]}x_i -\min_{i\in[n]} x_i \right )^2 \right ]^{m \times n}, Y \in \left [\min_{i\in[n]} x_i, \max_{i\in[n]} x_i \right ]^n$$

and other constraints given in the OP. Now you can apply different algorithms discussed in Section 4.2.3, which converge to a global optimum (but may be inefficient in large sacles.)

Related Question