Solved – Conditional Mutual Information, Chain Rule

information theory

Consider three discrete binary random variables — A, B, C.

I have calculated I(A;B) and I(A;C), but I want to calculate I(A; BC)

I'm having trouble implementing this.

From reading this paper I have an idea, but can't get it to work out in python.

I have these functions (python 3)

def entropy(X, Y): probs = [] for c1 in set(X): for c2 in set(Y): probs.append(np.mean(np.logical_and(X == c1, Y == c2))) return np.sum(-p * np.log2(p) for p in probs if p > 0)

and

def calc_MI(x, y, bins): c_xy = np.histogram2d(x, y, bins)[0] mi = mutual_info_score(None, None, contingency=c_xy) return mi

But I'm not sure how to move forward, thanks in advance.

Best Answer

If you want to compute I(A;BC) you do not necessarily have to use the conditional mutual information. I(A;BC) is the mutual information between A and the joint variable BC.

In python you might encode BC with 4 values: 0,1,2,3 for all the combinations of values for B and C. You might do this using: 2*b + c where b can be either 0 or 1 and c the same. Then you can compute the mutual information between A and BC.

Otherwise as you point out, you can use the formula from the paper: I(A;BC) = I(A;B) + I(A;C|B). In this case you have to use conditional probabilities.

Related Solutions

Information Gain and Mutual Information – Key Measures in Information Theory

I think that calling the Kullback-Leibler divergence "information gain" is non-standard.

The first definition is standard.

EDIT: However, $H(Y)−H(Y|X)$ can also be called mutual information.

Note that I don't think you will find any scientific discipline that really has a standardized, precise, and consistent naming scheme. So you will always have to look at the formulae, because they will generally give you a better idea.

Textbooks: see "Good introduction into different kinds of entropy".

Also: Cosma Shalizi: Methods and Techniques of Complex Systems Science: An Overview, chapter 1 (pp. 33--114) in Thomas S. Deisboeck and J. Yasha Kresh (eds.), Complex Systems Science in Biomedicine http://arxiv.org/abs/nlin.AO/0307015

Robert M. Gray: Entropy and Information Theory http://ee.stanford.edu/~gray/it.html

David MacKay: Information Theory, Inference, and Learning Algorithms http://www.inference.phy.cam.ac.uk/mackay/itila/book.html

also, "What is “entropy and information gain”?"

Solved – Calculating the mutual information between two histograms

According to wikipedia, mutual information of two random variables may be calculated using the following formula: $$ I(X;Y) = \sum_{y \in Y} \sum_{x \in X} p(x,y) \log{ \left(\frac{p(x,y)}{p(x)\,p(y)} \right) } $$

If I pick up your code from this:

[co1, ce1] = hist(randpoints1, bins); 
[co2, ce2] = hist(randpoints2, bins);

We can solve this the following way:

% calculate each marginal pmf from the histogram bin counts
p1 = co1/sum(co1);
p2 = co2/sum(co2);

% calculate joint pmf assuming independence of variables
p12_indep = bsxfun(@times, p1.', p2);

% sample the joint pmf directly using hist3
p12_joint = hist3([randpoints1', randpoints2'], [bins, bins])/points;

% using the wikipedia formula for mutual information
dI12 = p12_joint.*log(p12_joint./p12_indep); % mutual info at each bin
I12 = nansum(dI12(:)); % sum of all mutual information

I12 for the random variables that you generate, is quite low (~0.01), which is not surprising, since you generate them independently. Plotting the independence assumed distribution and the joint distribution side by side shows how similar they are:

If, on the other hand, we introduce dependence by generating randpoints2 to have some component of randpoints1, like this for example:

randpoints2 = 0.5*(sigma2.*randn(1, points) + mu2 + randpoints1);

I12 becomes much larger (~0.25) and represents the larger mutual information that these variables now share. Plotting the above distributions again shows a clear (would be clearer with more points and bins of course) difference between joint pmf that assumes independence and a pmf that's generated by sampling the variables simultaneously.

The code I used to plot I12:

figure;
subplot(121); pcolor(p12_indep); axis square;
xlabel('Var2'); ylabel('Var1'); title('Independent: P(Var1)*P(Var2)');
subplot(122); pcolor(p12_joint); axis square;
xlabel('Var2'); ylabel('Var1'); title('Joint: P(Var1,Var2)');

Best Answer

Related Solutions

Information Gain and Mutual Information – Key Measures in Information Theory

Solved – Calculating the mutual information between two histograms

Related Question