Solved – What are some distributions over the probability simplex

compositional-datadistributionsmultinomial-distribution

Let $\Delta_{K}$ be the probability simplex of dimension $K-1$, i.e. $x \in \Delta_{K}$ is such that $x_i \ge 0$ and $\sum_i x_i = 1$.

What distributions which are frequently (or well-known, or defined in the past) over $\Delta_{K}$ exist?

Clearly, there are the Dirichlet and the Logit-Normal distributions. Are there any other distributions which come up naturally in this context?

Best Answer

This is studied in compositional data analysis, there is a book by Aitchison: The Statistical Analysis Of Compositional Data.

Define the simplex by $$ S^n =\{(x_1, \dots,x_{n+1}) \in {\mathbb R}^{n+1} \colon x_1>0,\dots, x_{n+1}>0, \sum_{i=1}^{n+1} x_i=1\}. $$ Note that we use the index $n$ to indicate dimension! Define the geometric mean of an element of the simplex, $x$ as $\tilde{x}$. Then we can define the logratio transformation (introduced by Aitchison) as $x=(x_1, \dots, x_{n+1}) \mapsto (\log(x_1/\tilde{x}), \dots, \log(x_n/\tilde{x})$. This transformation is onto ${\mathbb R}^n$, so have an inverse which I leave to you to calculate (There are also other versions of this transformation that can be used, which has maybe better mathematical properties, more about that later).

Now you can take a normal (or whatever) distribution defined on ${\mathbb R}^n$ and use this inverse transformation to define a distribution on the simplex. The possibilities are limitless, for each and every multivariate distribution on ${\mathbb R}^n$ we get a distribution on the simplex.

I will augment this post later with some examples, and more details on log-ratio transforms.

Notation

I'm going to rescale your simplex by a factor $n$, so that the lattice points have integer coordinates. This doesn't change anything, I just think it makes the notation a little less cumbersome.

Let $S$ be the $(n-1)$-simplex, given as the convex hull of the points $(n,0,\ldots,0)$, ..., $(0,\ldots,0,n)$ in $\mathbb R^{n}$. In other words, these are the points where all coordinates are non-negative, and where the coordinates sum to $n$.

Let $\Lambda$ denote the set of lattice points, i.e. those points in $S$ where all coordinates are integral.

If $P$ is a lattice point, we let $V_P$ denote its Voronoi cell, defined as those points in $S$ which are (strictly) closer to $P$ than to any other point in $\Lambda$.

We put two probability distributions we can put on $\Lambda$. One is the multinomial distribution, where the point $(a_1, ..., a_n)$ has the probability $2^{-n} n!/(a_1! \cdots a_n!)$. The other we will call the Dirichlet model, and it assigns to each $P \in \Lambda$ a probability proportional to the volume of $V_P$.

Very informal justification

I'm claiming that the multinomial model and the Dirichlet model give different distributions on $\Lambda$, whenever $n \geq 4$.

To see this, consider the case $n=4$, and the points $A = (2,2,0,0)$ and $B=(3,1,0,0)$. I claim that $V_A$ and $V_B$ are congruent via a translation by the vector $(1,-1,0,0)$. This means that $V_A$ and $V_B$ have the same volume, and thus that $A$ and $B$ have the same probability in the Dirichlet model. On the other hand, in the multinomial model, they have different probabilities ($2^{-4} \cdot 4!/(2!2!)$ and $2^{-4} \cdot 4!/3!$), and it follows that the distributions cannot be equal.

The fact that $V_A$ and $V_B$ are congruent follows from the following plausible but non-obvious (and somewhat vague) claim:

Plausible Claim: The shape and size of $V_P$ is only affected by the "immediate neighbors" of $P$, (i.e. those points in $\Lambda$ which differ from $P$ by a vector that looks like $(1,-1,0,\ldots,0)$, where the $1$ and $-1$ may be in other places)

It's easy to see that the configurations of "immediate neighbors" of $A$ and $B$ are the same, and it then follows that $V_A$ and $V_B$ are congruent.

In case $n \geq 5$, we can play the same game, with $A = (2,2,n-4,0,\ldots,0)$ and $B=(3,1,n-4,0,\ldots,0)$, for example.

I don't think this claim is completely obvious, and I'm not going to prove it, instead of a slightly different strategy. However, I think this is a more intuitive answer to why the distributions are different for $n \geq 4$.

Rigorous proof

Take $A$ and $B$ as in the informal justification above. We only need to prove that $V_A$ and $V_B$ are congruent.

Given $P = (p_1, \ldots, p_n) \in \Lambda$, we will define $W_P$ as follows: $W_P$ is the set of points $(x_1, \ldots, x_n) \in S$, for which $\max_{1 \leq i \leq n} (a_i - p_i) - \min_{1 \leq i \leq n} (a_i - p_i) < 1$. (In a more digestible manner: Let $v_i = a_i - p_i$. $W_P$ is the set of points for which the difference between the highest and lowest $v_i$ is less than 1.)

We will show that $V_P = W_P$.

Step 1

Claim: $V_P \subseteq W_P$.

This is fairly easy: Suppose that $X = (x_1, \ldots, x_n)$ is not in $W_P$. Let $v_i = x_i - p_i$, and assume (without loss of generality) that $v_1 = \max_{1\leq i\leq n} v_i$, $v_2 = \min_{1\leq i\leq n} v_i$. $v_1 - v_2 \geq 1$ Since $\sum_{i=1}^n v_i = 0$, we also know that $v_1 > 0 > v_2$.

Let now $Q = (p_1 + 1, p_2 - 1, p_3, \ldots, p_n)$. Since $P$ and $X$ both have non-negative coordinates, so does $Q$, and it follows that $Q \in S$, and so $Q \in \Lambda$. On the other hand, $\mathrm{dist}^2(X, P) - \mathrm{dist}^2(X, Q) = v_1^2 + v_2^2 - (1-v_1)^2 - (1+v_2)^2 = -2 + 2(v_1 - v2) \geq 0$. Thus, $X$ is at least as close to $Q$ as to $P$, so $X \not\in V_P$. This shows (by taking complements) that $V_p \subseteq W_P$.

Step 2

Claim: The $W_P$ are pairwise disjoint.

Suppose otherwise. Let $P=(p_1,\ldots, p_n)$ and $Q = (q_1,\ldots,q_n)$ be distinct points in $\Lambda$, and let $X \in W_P \cap W_Q$. Since $P$ and $Q$ are distinct and both in $\Lambda$, there must be one index $i$ where $p_i \geq q_i + 1$, and one where $p_i \leq q_i - 1$. Without loss of generality, we assume that $p_1 \geq q_1 + 1$, and $p_2 \leq q_2 - 1$. Rearranging and adding together, we get $q_1 - p_1 + p_2 - q_2 \geq 2$.

Consider now the numbers $x_1$ and $x_2$. From the fact that $X \in W_P$, we have $x_1 - p_1 - (x_2 - p_2) < 1$. Similarly, $X \in W_Q$ implies that $x_2 - q_2 - (x_1 - q_1) < 1$. Adding these together, we get $q_1 - p_1 + p_2 - q_2 < 2$, and we have a contradiction.

Step 3

We have shown that $V_P \subseteq W_P$, and that the $W_P$ are disjoint. The $V_P$ cover $S$ up to a set of measure zero, and it follows that $W_P = V_P$ (up to a set of measure zero). [Since $W_P$ and $V_P$ are both open, we actually have $W_P = V_P$ exactly, but this is not essential.]

Now, we are almost done. Consider the points $A = (2,2,n-4,0,\ldots,0)$ and $B = (3,1,n-4,0,\ldots,0)$. It is easy to see that $W_A$ and $W_B$ are congruent and translations of each other: the only way they could differ, is if the boundary of $S$ (other than the faces on which $A$ and $B$ both lie) would ``cut off'' either $W_A$ or $W_B$ but not the other. But to reach such a part of the boundary of $S$, we would need to change one coordinate of $A$ or $B$ by at least 1, which would be enough to guarantee to take us out of $W_A$ and $W_B$ anyway. Thus, even though $S$ does look different from the vantage points $A$ and $B$, the differences are too far away to be picked up by the definitions of $W_A$ and $W_B$, and thus $W_A$ and $W_B$ are congruent.

It follows then that $V_A$ and $V_B$ have the same volume, and thus the Dirichlet model assigns them the same probability, even though they have different probabilities in the multinomial model.

Solved – What would be the alternative to the Dirichlet distribution but parametrized by mean and variance

Having received no other suggestions I answer to my own question. I ended up using the solution proposed in http://andrewgelman.com/2009/04/29/conjugate_prior/ (and in the related papers) although they considered it in prior modelling.

So to model the positive random values that must sum to one I consider the probability distribution $$ x_i = e^{z_i}/\sum_{j=1}^s e^{z_j},\quad i = 1,...,s $$ where $z_j \sim N(m_j,\sigma_j^2)$ indepently. Now the fractions are clearly positive and sum to 1. The parameters $m_j,\sigma_j^2$ can be fitted as described in the paper in the blog post, however I just fixed the first parameter to zero due to the unidentifiability problem and used simulated method of moment approach to fit the rest (I generate many realizations of $z_j$ beforehand and construct a least squares type fitness function for the given and simulated moments that I minimize using an optimization algorithm and finite differences). This procedure seems work fine in our case where $s$ is either 4 or 5 and the input information (the given means and variances) is somewhat uncertain anyway. In some rare cases some of the fractions can end up being highly correlated which maybe against what one wishes. Also if one of the fractions is certainly known to be zero is should be eliminated from the fitting procedure to avoid some issues.