Why is only the Fischer information a Riemannian metric

fisher informationriemannian-geometry

I recently discovered that the Fischer information induces a Riemannian metric on statistical manifolds. On discrete probability distributions the Fischer information is given by
$$\mathcal I_{j,k}(\theta)=\left\langle \sum_{j,k}\frac{\partial \ln p_i(\theta)}{\partial \theta_j}\frac{\partial \ln p_i(\theta)}{\partial \theta_k}\right\rangle=\sum_i\sum_{j,k}p_i(\theta)\frac{\partial \ln p_i(\theta)}{\partial \theta_j}\frac{\partial \ln p_i(\theta)}{\partial \theta_k}$$
where $p_i(\theta)$ is a discrete probability distribution over $i$, parametrized by a vector $\theta$ of parameters.

It induces a metric $h$ given by
$$h=\frac 1 4\sum_i\sum_{j,k}p_i(\theta)\frac{\partial \ln p_i(\theta)}{\partial \theta_j}\frac{\partial \ln p_i(\theta)}{\partial \theta_k}\mathrm d\theta_j\mathrm d\theta_k$$

My question is what makes this a Riemannian metric. Why not the metric $h=\sum_i\mathrm d p_i\mathrm d p_i$? Part of the confusion might stem from the fact that I do not understand the formal definition of a Riemannian metric completely.


Edit: after reading more carefully I seem to have misunderstood what I have read. I based this question on the following quote from this article https://en.wikipedia.org/wiki/Fisher_information_metric

The metric is interesting in several respects. By Chentsov’s theorem,
the Fisher information metric on statistical models is the only
Riemannian metric (up to rescaling) that is invariant under sufficient
statistics.[1][2]

So for me to fully appreciate this fact, I have to do some more reading on sufficient statistics.

Best Answer

The Fisher information metric is, as you say, a Riemannian metric defined on a finite-dimensional family of discrete probability distributions. If it is a $k$-dimensional family of distributions, it should be viewed as a $k$-dimensional smooth manifold $M$, where each point $P$ in the manifold is a discrete probability distribution, i.e., $$ P = (p_1, \dots, p_N), $$ where \begin{align*} p_1, \dots, p_N &> 0\\ p_1+\cdots + p_N &= 1. \end{align*}

A $k$-dimensional parameterization of all or part of the family is equivalent to a coordinate chart on the manifold $M$. More precisely, if we denote the parameter (coordinates) by $\theta = (\theta^1, \dots, \theta^k) \in U$, where $U$ is an open subset of $\mathbb{R}^k$. A parameterization (coordinate map) is a smooth injective immersion $\Phi: U \rightarrow M$. $\Phi$ can be written as $$ \Phi(\theta) = (p_1(\theta), \dots, p_N(\theta)). $$ The partial derivatives $$ \partial_i\Phi(\theta) = \left(\frac{\partial p_1}{\partial\theta^i}, \dots, \frac{\partial p_N}{\partial\theta^i}\right) $$ comprise a basis of the tangent space at each $\Phi(\theta)$.

A Riemannian metric is an inner product $\langle\cdot,\cdot\rangle$ defined on each tangent space $T_PM$ that depends smoothly on the parameter (coordinates) $\theta$. Given the coordinates, a Riemannian metric is uniquely determined by the positive definie symmetric matrix whose components are $$ g_{jk} = \left\langle \frac{\partial\Phi}{\partial \theta^j},\frac{\partial \Phi}{\partial \theta^k}\right\rangle. $$ There are infinitely many possible choices for which inner product to use. Two possible inner products are $$ \left\langle \frac{\partial \Phi}{\partial \theta^j},\frac{\partial \Phi}{\partial \theta^k}\right\rangle = \sum_{i=1} \frac{\partial p_i}{\partial \theta^j}\frac{\partial p_i}{\partial \theta^k} $$ and the one that defines the Fisher information metric $$ \left\langle \frac{\partial \Phi}{\partial \theta^j},\frac{\partial \Phi}{\partial \theta^k}\right\rangle = \sum_{i=1} p_i^{-1}\frac{\partial \ln p_i}{\partial \theta^j}\frac{\partial \ln p_i}{\partial \theta^k} $$ What I do not recall is why the Fisher information metric is more useful or natural than other possible choices.

Related Question