Mutual information is maximized when Poisson-Binomial reduces to Binomial

binomial distributionmutual informationoptimizationprobabilityrandom variables

I have two random variables $X$ and $Y$. $X$ follows Poisson-Binomial distribution with parameters $\{q_1, \ldots, q_k\}$. Thus, $X$ can take values in the set $\{0,1,\ldots,k\}$.

$Y$ is a binary random variable. $Y=0$ with probability $\frac{0.1}{1+x}$ when $X=x$. Thus, the transition probabilities $\frac{0.1}{1+x}$ is decreasing in $x$.

I am interested in the mutual information between $X$ and $Y$ denoted by $I(X;Y)$.

Specifically, I have a conjecture that the mutual information $I(X;Y)$
is maximum when all $q_i$'s are equal which leads to a Binomial
distribution for $X$.

Also, my intuition is that any transition probability which is decreasing in $x$ ( Similar to $\frac{0.1}{1+x}$) leads to this conjecture.

Can someone help me prove this? Or provide a counter example?

$\underline{\text{Some useful facts that I noted}}$:

For a given $\sum q_i$ and $k$, the Binomial distribution has the maximum variance.
The entropy of Poisson- Binomial distribution is bounded above by the entropy of a binomial distribution with the same $k$ and $\sum q_i$.

I have asked this question in cstheroySE as well to gather attention as the topics on Information Theory don't have a large audience comparatively.

Best Answer

I believe your conjecture is correct! But this relies on the result of Shepp and Olkin referenced on the wiki for the Poisson-Binomial distribution ("Entropy of the sum of independent Bernoulli random variables and of the multinomial distribution", 1981), which claims that the Poisson-Binomial distribution is concave on $[0,1]^n$. I can't access that proof right now, so I'll just assume it holds and move forward.

If we accept this result, then it suffices to locate a critical point of the function, as this critical point must be a global maximum by concavity.

Restricting to the case of a fixed mean $\mu := \sum_{i=1}^n p_i$ for $X$, we want to show that the point $\mathbf{p}^* = (p_1, \dots, p_n) = (\mu/n, \dots, \mu/n)$ describes a critical point of the function $$ I(X;Y) = \sum_{k=0}^n \sum_{\ell=0}^1 P(X=k, Y=\ell) \log\left({P(X=k, Y=\ell) \over P(X=k) P(Y=\ell)} \right). $$ Writing $\mathbf{p} = (p_1, \dots, p_n)$ and $\mathbb{1} = (1, \dots, 1)$, then we are working only with points $\mathbf{p}$ that lie in the plane $M = \{ \mathbf{x} : \mathbf{x} \cdot \mathbb{1} = \mu \}$.

We can simplify the joint distribution, writing $P(X=k, Y=\ell) = P(Y=\ell | X=k) P(X=k)$, so that $$ I(X;Y) = \sum_{k=0}^n P(X=k) \sum_{\ell=0}^1 P(Y=\ell | X=k) \log\left({P(Y=\ell | X=k) \over P(Y=\ell)} \right). $$

Now $P(Y=\ell|X=k)$ is independent of the $p_i$, so directional derivatives of $I(X;Y)$ w.r.t. the $p_i$ will be determined by directional derivatives of the terms $P(X=k)$ and $P(Y=\ell)$. I do this using the directions $e_{i,j}$ which are +1 in the $p_i$-direction and -1 in the $p_j$-direction. In short,

compute the directional derivatives of $P(X=k)$ to show it has a critical point at $\mathbf{p}^*$, i.e. for $\mathbf{v} \in \mathbb{1}^{\perp}$, $$ \partial_{\mathbf{v}}P(X=k) = 0. $$ This can also be done using your symmetry argument from the cstheory discussion: at $\mathbf{p}^* = (\mu/n) \mathbb{1}$, all $\partial_{e_{i,j}} P(X=k)$ must be equal, but $e_{i,j} = - e_{j,i}$ so the derivatives have opposite signs.
Note $P(Y=\ell)$ is a linear combinations of the $P(X=k)$, so it too has a critical point at $\mathbf{p}^*$.
If the directional derivatives of $\partial_{\mathbf{v}}P(X=k)$ and $\partial_{\mathbf{v}}P(Y=\ell)$ disappear, then $\partial_{\mathbf{v}}I(X;Y) = 0$ as well. ~~And we're done!~~

EDIT:

At this point we know that $I(X;Y)$ has a critical point at ${\mu \over n} \mathbb{1}$, but its concavity is not clear from that of $P(X)$. Instead, we use a basic fact about mutual information (mentioned by OP, and with a proof here by Mark Braverman), that with $P(Y|X)$ fixed, $I(X;Y)$ is concave in $P(X)$.

However, $P(Y|X)$ is constant with respect to the $\{p_i\}$, so $I(X;Y)$ is concave (in the $\{p_i\}$).

This completes the proof, without reference to the structure of $Y$'s distribution.

Some Notes:

I am using $p_i$ rather than the $q_i$ of your problem statement. Sorry!
I do not prove uniqueness of the maximum. I believe this should be simple to show, but I'd rather get the main thrust written and double back than worry too much about it.
Most interestingly, the specific form of $P(Y=\ell | X=k)$ does not enter into the proof at any point! So for any other random variable $Y$ whose pdf is determined by the value of $X$ this same result should hold; $I(X;Y)$ is maximized within $M$ at the point ${\mu \over n} \mathbb{1}$.

Best Answer

Related Solutions

How can mutual information between random variables decrease with increasing correlation

Is there a chain rule for Sibson’s mutual information

Related Question