Solved – Feature selection using mutual information in Matlab

feature selectioninformation theoryMATLABmutual information

I am trying to apply the idea of mutual information to feature selection, as described in these lecture notes (on page 5).

My platform is Matlab. One problem I find when computing mutual information from empirical data is that the number is always biased upwards. I found about 3~4 different files to calculate MI on Matlab Central and they all give big numbers (like > 0.4) when I feed in independent random variables.

I am not an expert, but the problem seems to be that if you simply use joint and marginal densities to compute MI, bias is introduced in the process because MI is by definition positive. Does anyone have practical advice on how to estimate mutual information accurately?

A related question is, in practice, how do people actually use MI to select features? It is not obvious to me how to come up with a threshold value since MI is in theory unbounded. Or do people just rank the features by MI and take the top k features?

Best Answer

This is the problem of limited sampling bias.

The small sample estimates of the densities are noisy, and this variation induces spurious correlations between the variables which increase the estimated information value.

In the discrete case this is a well studied problem. There are many techniques to correct, from the fully Bayesian (NSB), to simple corrections. The most basic (Miller-Madow) is to subtract $(R-1)(S-1) / 2N\ln2$ from the value. This is the difference in degrees of freedom between the two implicit models (full joint multinomial vs the product of independent marginals) - indeed with sufficient sampling $2N\ln(2)I$ is the likeilhood ratio test of indepenence (G-test) which is $\chi^2$ distributed with $(R-1)(S-1)$ d.o.f. under the null hypothesis. With limited trials it can even be hard to estimate R and S reliably - an effective correction is to use a Bayesian counting procedure to estimate these (Panzeri-Treves or PT correction).

Some package implementing these techniques in Matlab include infotoolbox and Spike Train Analysis Toolkit.

For the continuous case, estimators based on nearest neighbour distances reduuce the problem.