Solved – How to calculate mutual information between a feature and target variable

descriptive statisticsmutual information

Mutual information measures how much information the distribution of one variable provides about the distribution of another variable.

In my case, I have samples of a feature variable $X \in \mathbb{R}$ and a target variable $Y \in \mathbb{R}$. This is different from the usual pairing of categorical variables (i.e. class labels) $(Y_{true}, Y_{predicted})$ you can find in statistical packages.

My current approach is:

  1. Standardize $X$ and $Y$ to have $0$ mean and unit variance so they are more or less on the same scale.
  2. Make a contingency table from the variables. Basically, create a discrete grid of bins and count how many pairs $(x \sim X, y \sim Y)$ occur in each space. Like a 2D histogram.
  3. Normalize the table so it all sums to $1$. This would approximate the joint distribution $P(X,Y)$.
  4. Use this to compute mutual information.

Is this approach theoretically sound? If so, how is the size of bins chosen? Is there a need for standardizing $X, Y$ in the first place?

Best Answer

The mutual information between two continuous random variables $X$ and $Y$ is defined as the following double integral over the domains $\mathcal {X}$ and $\mathcal {Y}$: $$ {\displaystyle \operatorname {I} (X;Y)=\int _{\mathcal {Y}}\int _{\mathcal {X}}{p(x,y)\log {\left({\frac {p(x,y)}{p(x)\,p(y)}}\right)}}\;dx\,dy,} $$

The most straightforward way to estimate the mutual information is to use binning to estimate the integral, which essentially converts continuous variables into discrete variables for which the approach you outlined above can be used.

Alternatively, k-nearest neighbor distances can be used to estimate the Shannon entropy terms. This is for instance the approach that's used in scikit-learn and was proposed in A. Kraskov, H. Stogbauer and P. Grassberger, "Estimating mutual information". Phys. Rev. E 69, 2004.