[Math] Optimization of a function over probability distributions

I'm trying to solve certain optimization problems dealing with probability distributions. Consider the space of probability distributions $\{ 1, …, N\} \to [0, 1]$

I have a function $f : (\{ 1, …, N\} \to [0, 1]) \to \mathbb{R}$ that I'd like to find either minimums or maximums.

I have an idea to parameterize the measure space over some real valued space like $\mathbb{R}^n$. This would let me find critical points using standard gradient methods. I was able to do this for the $N=2$ case by parametrizing $[0, 1]$ over the real line to determine the probability for $1$ and then the probability for $2$ follows from the probability for $1$ but I failed to even figure this out for $N=3$.

It's also fairly easy to parameterize the space of functions $\{ 1, …, N\} \to \mathbb{R}$ that have additivity by treating them like vectors and parameterizing the linear space that satisfies the linear equation $\sum_{w\in{1,…,N}} p(w) = 1$.

$$p(1) = \sum_{w\in{2,…,N}} p(w)$$
$$p(2) = a_1$$
$$…$$
$$p(N) = a_{N-1}$$

where $a_i$ is free. Incorporating the bounds is harder however. I manually worked out the bounded space for N=3 and it winds up being the right triangle with leg lengths both 1 and hypotenuse root 2. I also worked it out for the N=4 case where it winds up being a tetrahedron with 3 faces as right triangles (just like the 1 before) and 1 face an equilateral. I have two problems with this however. I have 2 problems here however. 1) I don't know how to parameterize the tetraheadron (I have not tried parametrizing the triangle either but I assume it isn't hard) and 2) I don't how these shapes generalize to higher dimensions so I'm stuck at the N=4 case.

Specifically I am concerned with finding local minimums and maximums to the following function (cross entropy): $$p, q \mapsto \sum_{w\in{1,…,N}} p(w) \cdot log_2(\frac{1}{q(w)})$$

For the N=2 case (the only non-trivial case I have solved) it turns out this is maximized at $p(1) = \frac{1}{e}$ and $q(1) = \frac{1}{1+\frac{1}{e-1}}$ which is kinda neat because it doesn't depend on the base of the logarithm.

Does anyone know how to parameterize these probability distributions? Is there a better way to go about finding minimums and maximums here?

note: I've tagged convex analysis/optimization because probability distributions are convex functions.

Best Answer

So I finally with a bit of help from a number of people seem to have figured this one out.

First note that we can map the sub-additive subspace of $\prod^N [0,1]$ onto the additive space of $\prod^{N+1} [0,1]$ by adding another dimension which is 1 minus the sum of the other dimensions. I believe this is a special case of barycentric coordinates.

Next take a look at this mapping where $V_{N+1}$ is assumed to be $0$ for the sake of clarity.

$$(V_1, ..., V_N)^T \mapsto (\prod^N_{i=1} V_i - \prod^{N + 1}_{i=1} V_i, ..., V_1 - V_1 \cdot V_2)^T$$

One can check that this maps $\prod^{N} [0,1]$ to the sub-additive subspace of $\prod^{N+1} [0,1]$

Composing the two we get this nice mapping (still using the fact that $V_{N+1}$ is zero and additionally using the fact that an empty map is

$$(V_1, ..., V_N)^T \mapsto (...,\prod^{N-n}_{i=1} V_i - \prod^{N - n + 1}_{i=1} V_i,...)^T$$ where $n$ goes from zero to N inclusive.

It can again be checked that this maps $\prod^{N} [0,1]$ to the additive subspace of $\prod^{N+1} [0,1]$

This seems like a nice (time will tell) analytical surjection from cubes to probability distributions and because additionally we can map Euclidean space surjectivelly onto a square of unit volume we can compose these maps to get a mapping from N-Euclidean space to an N+1-dimensional probability distribution. This lets us take the gradient and solve the system of equations which turn out to be remarkably solvable. I've also had good luck using this kind of method to do some other basic stuff.

Best Answer

Related Solutions

[Math] Orthogonality constraints when formulating SVD as optimization problem

Related Question