Statistical distance induced by Fisher information metric on statistical manifold of categorical distribution (simplex)

I am trying to compute the information length or distance induced by the Fisher information metric on the statistical manifold of the categorical distribution (the interior of the n-dimensional simplex). I have checked each part of my computation several times. However, the result I obtain is dependent on my original choice of chart. How is this possible? Changing how the computation is done, I obtain a result consistent with another method, which I discuss at the end. However, inspection reveals that there are problems associated with that as well. How can I adapt the derivation to get the correct expression for the information distance?

Here, I summarise my computation of the information distance:

Suppose there are n+1 possible outcomes. Let $\mathring \Delta^n =
\{x=(x_0, …, x_n) \in \mathbb R^{n+1}\:|\: x_i > 0, \sum_i x_i
=1\}$, be the statistical manifold of the categorical distribution in this case.

We choose the chart $ \psi: \mathring \Delta^n \to Im(\psi) =:U
\subset \mathbb R^n : (x_0, …, x_n) \mapsto (x_1, …, x_n)$;
$\psi^{-1}(y_1, …, y_n)= (1-\sum_{i=1}^ny_i,y_1, …, y_n)$. So now
we can work on local (in fact global) coordinates on $U$.

The computation of the Fisher information metric is fairly
straightforward
(see https://www.ii.pwr.edu.pl/~tomczak/PDF/%5BJMT%5DFisher_inf.pdf for details). It
is given by:

$$g(y)=\sum_{\substack{i=1}}^n \frac 1{y_i} dy_i \otimes dy_i$$

Let $y^0, y^1 \in U$ be two points, we would like to find the distance
$d(y^0, y^1)$ induced by the Fisher information metric. This is the
length of the geodesic $\gamma :[0,1]\to \mathring \Delta^n$ between
the two. The length of a curve is given by:

$$L(\gamma)=\int_0^1 \sqrt{\dot \gamma(t)^Tg(\gamma(t)) \dot \gamma(t) }\:
dt = \int_0^1 \sqrt{\sum_{i=1}^n \frac {\dot
\gamma_i(t)^2}{\gamma_i(t)}} \: dt$$

We can obtain the geodesic via the geodesic equation $\ddot \gamma_k +
\sum_{ij}\Gamma^k_{ij} \dot \gamma_i \dot \gamma_j =0$, where
$\Gamma^k_{ij}$ are the Christoffel symbols of the Levi-Civita
connection. In our case the only non-zero Christoffel symbols are:

$$\Gamma^i_{ii}(y)=-\frac 1{2y_i}$$

The geodesic equation then becomes:

$$2\gamma_i\ddot \gamma_i – ( \dot \gamma_i)^2 =0, \forall i
=1,…,n$$

where $\gamma_i$ is the $i$-th component of the geodesic. It is clear from this equation that it admits a polynomial solution of
degree two. Solving with the boundary conditions $\gamma_i(0)=y^0_i,
\gamma_i(1)=y^1_i$ and constraint $0<\gamma_i(t)<1, \forall t$, we obtain the geodesic:

$$\gamma_i(t)=(\sqrt{y_i^0}-\sqrt
{y_i^1})^2t^2+2(\sqrt{y_i^0y_i^1}-y^0_i)t+y^0_i, t \in [0,1]$$

Recalling the definition of length, it is possible to show that $\frac {\dot
\gamma_i(t)^2}{\gamma_i(t)}\equiv constant, \forall i$. One way to do this is
to take the derivative of this expression of notice that it is zero.
With some rearrangement this implies

$$L(\gamma)= \sqrt{\sum_{i=1}^n \frac {\dot
\gamma_i(0)^2}{\gamma_i(0)}}= 2 ||\sqrt {y^1}- \sqrt {y^0}||=d(y^0,
y^1)$$

where $||\cdot||$ denotes the Euclidean metric and $\sqrt{\cdot}$ is
performed componentwise.

Summarising, for points $x^0, x^1 \in \mathring \Delta^n$, $$d(x^0, x^1) = 2|| \sqrt{\psi(x^0)}-\sqrt{ \psi(x^1) }||$$
where $||\cdot||$ denotes the Euclidean metric and $\sqrt{\cdot}$ is
performed componentwise.

The resulting formula is nice because it relates the information distance to the Euclidean distance. The problem is: it depends on the choice of chart.

If one chooses a different chart, e.g. $ \psi': \mathring \Delta^n \to Im(\psi') =:U'
\subset \mathbb R^n : (x_0, …, x_n) \mapsto (x_0, …, x_{n-1})$, one obtains different values for the distance.

Seeing this problem, I was tempted not to work in a chart at all since the statistical manifold is a subset of $\mathbb R^{n+1}$. That is to say, doing the exact same calculations using $x$ instead of $y$ and forgetting about charts. This gives the expression: $d(x^0, x^1) = 2|| \sqrt{x^0}-\sqrt{x^1 }||$.

This formula is much nicer than the first. It coincides (although rearranged) with the result obtained in p4 of http://www.pieter-kok.staff.shef.ac.uk/docs/geometrical_Cramer-Rao.pdf, using the Euclidean distance on the n-sphere. However, inspection shows that there are a number of problems with this.

To obtain it I used an n+1 dimensional Fisher information matrix, while the statistical manifold is n-dimensional.
The ensuing geodesic does not lie on the simplex, i.e. the sum of its n+1 components is mostly $\neq 1$.

This second approach corresponds to extending the statistical manifold of interest to an open neighbourhood of $\mathring \Delta^n$, and indeed, the geodesics are geodesics in this space, as I could verify numerically that they were extrema of the energy functional of paths (https://en.wikipedia.org/wiki/Geodesic#Riemannian_geometry).

Lastly, the case $n=1$ is sketched in http://www.boris-belousov.net/2017/07/11/distance-between-probabilities/#geodesic-distance-between-distributions but doesn't coincide with any of these approaches.

What did I miss? How can I adapt the derivation to get the right expression for the informational distance? Thank you for your help!

Best Answer

Your main problem is the formula $g(y)=\sum_{\substack{i=1}}^n \frac 1{y_i} dy_i \otimes dy_i$. This is not correct. Instead you should have $g(y)=\sum_{\substack{i=0}}^n \frac 1{y_i} dy_i \otimes dy_i$ where $y_0=1-\sum_1^n y_i$ and so $dy_0= -\sum_1^n d y_i$. This is the result in your first link.

It is in fact best to think of this metric on $\Delta$ as a restriction of the metric $g(x)=\sum_{\substack{i=0}}^n \frac 1{x_i} dx_i \otimes dx_i$ on the ambient $\mathbb{R}^{n+1}$, or better yet, from the positive orthant $O=\{x|x_i>0\}$ in $\mathbb{R}^{n+1}$. While you can not compute the geodesics of the restriction directly, you can use the fact that diffeomorphism $x_i=w_i^2$ of $O$ to itself sends $g$ to the flat Euclidena metric $g_e=4\sum_0^n dw_i\otimes d w_i$, for indeed $dx_i=2w_i dw_i$ and $\frac 1{x_i} dx_i \otimes dx_i=dw_i\otimes d w_i$. In particular, the subset $\Delta=\{x\in O| \sum x_i=1\}$ with the Fisher metric (restriction of $g(x)$) is taken to the positive part of the sphere $S_+=\{w\in O| \sum w_i^2=1\}$ with the restriction of Euclidean metric (this is the result in formula 10 of your second link). Note that the distance is then the geodesic distance on the sphere, not ambient space $(O, g_e)$ (this seems to be a mistake in formula 11 in your second link). That is, its 2 times the length of the spherical arc from $w^0$ to $w^1$. Since the sphere is of unit radius, this is the same as the 2 times the angle between $w^0$ and $w^1$, i.e. $2 \arccos (w^0\cdot w^1)$. This is the same as the result for $n=1$ in your third link.

By the way, the computations you performed with $y_i$'s and the metric $g(y)=\sum_{\substack{i=1}}^n \frac 1{y_i} dy_i \otimes dy_i$ can be interpreted as consistent with the metric being isometric to Euclidean metric via $y_i=z_i^2$, the same thing as in the $x$s and $w$s except in one dimension lower -- except for the fact that you seem to have missed a minus sign in your geodesic equation, which should read $-2\gamma_i\ddot \gamma_i + ( \dot \gamma_i)^2 =0$. With this modification it is in fact equivalent to $z_i''=0$ i.e. the geodesic equations for straight lines -- geodesics in the $z$-coordinate space. The corrected formula is also $z(t)=(1-t)z^0+tz^1$, equivalently, $y(t)= ((1-t)\sqrt{y^0}+t\sqrt{y^1})^2$, indeed quadratic in $t$ in the $y$-coordinates.

To summarize, the simplex in information metric is isometric to the 2 times positive part of the sphere by taking coordinate-wise square root, and the distance is 2 times the angle between images.

Best Answer

Related Solutions

[Math] Compute distance induced by riemannian metric

Related Question