Python – How to Calculate Mutual Information Using Numpy and Pandas?

information theorymutual informationnumpypandaspython

I am a bit confused. Can someone explain to me how to calculate mutual information between two terms based on a term-document matrix with binary term occurrence as weights?

$$
\begin{matrix}
& 'Why' & 'How' & 'When' & 'Where' \\
Document1 & 1 & 1 & 1 & 1 \\
Document2 & 1 & 0 & 1 & 0 \\
Document3 & 1 & 1 & 1 & 0
\end{matrix}
$$

$$I(X;Y)= \sum_{y \in Y} \sum_{x \in X} p(x,y) \log\left(\frac{p(x,y)}{p(x)p(y)} \right)$$

Thank you

Best Answer

How about forming a joint probability table holding the normalized co-occurences in documents. Then you can obtain joint entropy and marginal entropies using the table. Finally, $$I(X,Y) = H(X)+H(Y)-H(X,Y). $$

Related Solutions

Solved – Can the mutual information of a “cell” be negative

In,

$I(x,y)= \sum_{x \in X} \sum_{y \in Y} p(x,y) \log_2 (\frac{p(x , y)}{p(x)p(y)})$

$CI(x,y)= \sum_{y \in Y} p(x,y) \log_2 (\frac{p(x , y)}{p(x)p(y)})$ for $x \in X$

we have,

$p(\cdot) \in [0,1]$

$\rightarrow\frac{p(\cdot)}{p(\cdot)p(\cdot)} \in [0...\infty ]$

$\rightarrow log_2\frac{p(\cdot)}{p(\cdot)p(\cdot)} \in (0...\infty ]$

$\rightarrow p(\cdot) log_2\frac{p(\cdot)}{p(\cdot)p(\cdot)} \in [0...\infty ]$

the codomain of $I$ and $CI$ is defined only on positive real values.

Solved – Mutual Information based feature selection

You are as such correct, but I would suggest using Weka to do it for you. For example, the following piece of java code will help you choose the attributes by mutual information using Weka.

    Instances trainingInstances; // feature matrix
    int n = ...; // number of features to select
    AttributeSelection attributeSelection = new AttributeSelection();
    InfoGainAttributeEval infoGainAttributeEval = new InfoGainAttributeEval();
    Ranker ranker = new Ranker();
    ranker.setNumToSelect(n);
    attributeSelection.setEvaluator(infoGainAttributeEval);
    attributeSelection.setSearch(ranker);
    attributeSelection.setInputFormat(trainingInstances);
    Instances featureSelected = Filter.useFilter(trainingInstances, attributeSelection);

I am assuming here that you have familiarity using Weka library.

Best Answer

Related Solutions

Solved – Can the mutual information of a “cell” be negative

Solved – Mutual Information based feature selection

Related Question