Maximize Mutual Information between multiple (“overlapping”) RVs

combinatoricsentropymutual information

I'd like to maximize the sum of Mutual Information between a RV $X$ and $K$ out of $N$ possible RVs $Z_i$.
$$ \max \sum_{i \in K} \text{MI}(X, Z_i) $$
However, when I unfold the sum I get
$$ \sum_{i \in K} \text{MI}(X, Z_i) = \sum_{i \in K} \text{H}(X) – \text{H}(X \vert Z_i) = K \cdot \text{H}(X) – \sum_{i \in K} \text{H}(X \vert Z_i) \Longleftrightarrow \sum_{i \in K} \min_i \text{H}(X \vert Z_i)$$
Because the RV might "overlap", I have to remove their Mutual Information again, e.g., for K = 2 I have:
$$\min_{i,j} \text{H}(X \vert Z_i) + \text{H}(X \vert Z_j) – \text{MI}(Z_i, Z_j)$$
or more general:
$$\min_{i} \sum_{i \in K} \text{H}(X \vert Z_i) – \sum_{(i,j) \in K} \text{MI}(Z_i, Z_j)$$
It's obvious that for large $K$ this expression is exploding in the number of possible combinations $K$ out of $N$. Is there any helpful (information-theoretic) optimization strategy to solve this problem? Also happy about research literature that tackles this problem.

EDIT: By overlapping I mean that the $Z_i$ can share information, such as shown in this publication here (note that the variables have different letters).

EDIT 2: In my context the RV $Z_i$ are manipulations on an unknown system $X$. I only have a limited number of manipulations I can make on that system (budget constraints) and want to learn as much as possible about it (remove the uncertainty). Therefore, if I pick my manipulations greedily (i.e. just by their MI), then I might learn redundant things about the system (that's the "overlapping"). Hence, I want take this into account when selecting the $K$ manipulations which remove the uncertainty the most together.

enter image description here

Best Answer

Your problem can be formulated more rigorously as an Optimal Experimental Design problem. Rather than maximizing a sum of mutual informations, your initial goal is to maximize the mutual information $I(X,Z)$ where $Z = [Z_1, Z_2, \dots , Z_K]$. You did not specify over what you are optimizing, but I assume it is over some parameters/inputs on which $p(Z_i)$ implicitly depends.

The chain rule for the mutual information yields

$$ I(X,Z) = \sum_{i=1}^K I(X;Z_i|Z_{i-1}, \dots , Z_1) $$

Hence, computing $\rm{max}_Z I(X,Z)$ can be rewritten as

$$ \rm{max}_{Z_1} \Biggl[ I(X;Z_1) + \rm{max}_{Z_2} \biggl[ I(X;Z_2|Z_1) + \rm{max}_{Z_3} \bigl[ I(X;Z_3|Z_1,Z_2) + [\dots] \bigr] \biggr] \Biggr] $$

which is, as you explained, an intractable problem, since the algorithmic complexity scales exponentially with $K$. The classical solution is to perform so-called myopic optimization, i.e. to optimize for only one manipulation at a time while taking into account previous manipulations but not future ones:

$$ \sum_{i=1}^K \rm{max}_{Z_i} I(X;Z_i|Z_{i-1},\dots,Z_1) $$

which is not as optimal as optimizing directly $I(X;Z)$, but which makes the problem computationally tractable. You can have a look at our recent preprints, in which we propose a particle-filtering based solution for mutual information optimization:

Gontier, C., Surace, S. C., & Pfister, J. P. (2022). Efficient Sampling-Based Bayesian Active Learning. arXiv preprint arXiv:2201.07539.

Related Question