Maximal Mutual Information Between Continuous and Discrete Random Variables – Information Theory

it.information-theorypr.probabilityprobability distributions

Let $X\sim \mathcal{N}(\mu,\sigma^2)$ be a Gaussian random variable with random mean $\mu\sim {\sf Bernoulli}(p)$, i.e., $\mu=1$ with probability $p$ and $\mu=0$ with probability $1-p$. In other words, to draw a realization of $X$, first one flips a coin to decide whether $\mu=0$ or $\mu=1$ and then, one samples from the Gaussian distribution with chosen mean.

Consider the family $\mathcal{M}$ of Bernoulli random variables taking on the values $0$ or $1$.

Let $\mathcal{I}\left(Y,X\right)$ be the mutual information between $Y$ and $X$ (a natural definition can be found, e.g., in this mathoverflow post).

My intuition tells that a global solution for the optimization problem

$$\max_{Y\in\mathcal{M}} \mathcal{I}(Y,X)$$

is given by $\mu$ itself, i.e.,

$$\mathcal{I}(\mu,X) = \max_{Y\in\mathcal{M}} \mathcal{I}(Y,X).$$

Any reference or ideas on how to prove or disprove this claim would be appreciated.

Best Answer

$\newcommand{\de}{\delta}\newcommand{\M}{\mathcal M} \newcommand{\ep}{\varepsilon} \newcommand{\thh}{\theta}\newcommand\I{\mathcal I}\newcommand{\Si}{\Sigma}$Here is the (slightly edited) definition of the mutual information given in the answer linked in the OP:

Let $D$ be any discrete random variable (r.v.) with distinct values $d_i$ taken with probabilities $p_i=P(D=d_i)>0$ for $i\in I$, where $I$ is a denumerable (that is, at most countable) set. Let $X$ be any r.v. (defined on the same probability space as $D$), with values in any nonempty set $S$ (given also some sigma-algebra $\Sigma$ over $S$, to make $S$ a measurable space). Let $\nu$ be the probability distribution of $X$, so that $\nu(B)=P(X\in B)$ for all $B\in\Sigma$. For each $i\in I$ and each $B\in\Sigma$, let \begin{equation*} \nu_i(B):=P(D=d_i,X\in B). \end{equation*} Then $\nu_i$ is a (sub-probability) measure absolutely continuous with respect to $\nu$, so that we can consider a Radon--Nikodym density \begin{equation*} \rho_i:=\frac{d\nu_i}{d\nu} \end{equation*} of the measure $\nu_i$ with respect to $\nu$, so that the values of $\rho_i$ are in $[0,1]$.

Then the mutual information between $D$ and $X$ is defined as follows: \begin{equation*} \I(D,X):=\sum_{i\in I}\int_S d\nu\;\rho_i\ln\frac{\rho_i}{p_i}. \end{equation*}


Let us now answer the current question.

Much more generally than in the OP, let $X$ be any r.v. as in the highlighted text above. Let $\nu$ denote the distribution of $X$.

Let $\M$ denote the set of all Bernoulli r.v.'s $Y$ (defined on the same probability space as $X$) with $P(Y=1)\in(0,1)$; we suppose that the probability space is rich enough so that for any coupling $\gamma$ of any Bernoulli distribution and the distribution $\nu$ of $X$ there is a r.v. $Y$ such that the joint distribution of the pair $(Y,X)$ is $\gamma$.

In view of the highlighted definition, for any $Y\in\M$, \begin{equation*} \I(Y,X)=\int_S d\nu\; \Big(\rho\ln\frac{\rho}{p}+(1-\rho)\ln\frac{1-\rho}{1-p}\Big), \end{equation*} where $p:=P(Y=1)$, $\rho:=\frac{d\nu_1}{d\nu}$, and $\nu_1(B):=P(Y=1,X\in B)$ for all $B\in\Si$; we are assuming the convention that $0\times\text{anything}:=0$.

We want to maximize $\I(Y,X)$ over all $Y\in\M$. This is very easy to do, noting that $u\ln\frac{u}{p}+(1-u)\ln\frac{1-u}{1-p}$ is convex in $u\in[0,1]$. So, $\I(Y,X)$ is maximized when $\rho=1_A$ for some $A\in\Si$, and then $p=\nu(A)$ and $\I(Y,X)=\int_A d\nu\; \ln\frac1{p}=p\ln\frac1p$. Maximizing the latter expression in $p\in(0,1)$, we see that, contrary to the conjecture in the OP,
\begin{equation} \max_{Y\in\M}\I(Y,X)=\frac1e, \end{equation} for any r.v. $X$ whatsoever provided that $X$ is defined on a rich enough probability space.

Related Question