Solved – Feature selection using mutual information in Matlab

feature selectioninformation theoryMATLABmutual information

I am trying to apply the idea of mutual information to feature selection, as described in these lecture notes (on page 5).

My platform is Matlab. One problem I find when computing mutual information from empirical data is that the number is always biased upwards. I found about 3~4 different files to calculate MI on Matlab Central and they all give big numbers (like > 0.4) when I feed in independent random variables.

I am not an expert, but the problem seems to be that if you simply use joint and marginal densities to compute MI, bias is introduced in the process because MI is by definition positive. Does anyone have practical advice on how to estimate mutual information accurately?

A related question is, in practice, how do people actually use MI to select features? It is not obvious to me how to come up with a threshold value since MI is in theory unbounded. Or do people just rank the features by MI and take the top k features?

Best Answer

This is the problem of limited sampling bias.

The small sample estimates of the densities are noisy, and this variation induces spurious correlations between the variables which increase the estimated information value.

In the discrete case this is a well studied problem. There are many techniques to correct, from the fully Bayesian (NSB), to simple corrections. The most basic (Miller-Madow) is to subtract $(R-1)(S-1) / 2N\ln2$ from the value. This is the difference in degrees of freedom between the two implicit models (full joint multinomial vs the product of independent marginals) - indeed with sufficient sampling $2N\ln(2)I$ is the likeilhood ratio test of indepenence (G-test) which is $\chi^2$ distributed with $(R-1)(S-1)$ d.o.f. under the null hypothesis. With limited trials it can even be hard to estimate R and S reliably - an effective correction is to use a Bayesian counting procedure to estimate these (Panzeri-Treves or PT correction).

Some package implementing these techniques in Matlab include infotoolbox and Spike Train Analysis Toolkit.

For the continuous case, estimators based on nearest neighbour distances reduuce the problem.

Related Solutions

Solved – Mutual Information based feature selection

You are as such correct, but I would suggest using Weka to do it for you. For example, the following piece of java code will help you choose the attributes by mutual information using Weka.

    Instances trainingInstances; // feature matrix
    int n = ...; // number of features to select
    AttributeSelection attributeSelection = new AttributeSelection();
    InfoGainAttributeEval infoGainAttributeEval = new InfoGainAttributeEval();
    Ranker ranker = new Ranker();
    ranker.setNumToSelect(n);
    attributeSelection.setEvaluator(infoGainAttributeEval);
    attributeSelection.setSearch(ranker);
    attributeSelection.setInputFormat(trainingInstances);
    Instances featureSelected = Filter.useFilter(trainingInstances, attributeSelection);

I am assuming here that you have familiarity using Weka library.

Feature Selection – Information Gain vs. Mutual Information for Feature Selection

They are identically the same.

If $X$ is a nominal feature with $k$ different values and $C$ is your target class with $m$ classes.

\begin{align} \mbox{MI} &= \sum_{i=1}^k\sum_{j=1}^m P(x_i,c_j)\log \frac{P(x_i,c_j)}{P(x_i)P(c_j)}\\ &=-\sum_{i=1}^k\sum_{j=1}^m P(x_i,c_j)\log P(c_j) + \sum_{i=1}^k\sum_{j=1}^m P(c_j|x_i)P(x_i)\log P(c_j|x_i)\\ &= -\sum_{j=1}^m P(c_j)\log P(c_j) + \sum_{i=1}^kP(x_i)\sum_{j=1}^m P(c_j|x_i)\log P(c_j|x_i)\\ &= H(C) - H(C|X) = \mbox{IG} \end{align}

This is a case of inconsistent naming. You might want to have a look at this question too.

Best Answer

Related Solutions

Solved – Mutual Information based feature selection

Feature Selection – Information Gain vs. Mutual Information for Feature Selection

Related Question