Solved – Feature selection : how to select the Information Gain threshold

entropyfeature selectionmachine learningsvmtext mining

I am trying to use Information Gain to select features when classifying text with a Support Vector Machine.

For each word in our training data, we computed its information gain. Then, we should keep only the words with an IG superior to some threshold.

When reading the litterature about this I did not find a clear explanation about how we should select this thresold.

Given a training corpus, for each unique term we computed the information gain, and removed from the feature space those terms whose information gain was less than some predetermined threshold.

Source: A comparative study on feature selection in Text Categorization

.

For each dataset we selected the subset of features with non-zero information gain.

Source: Information Gain, Correlation and Support Vector Machine

When training our SVM with words having IG > 0 as suggested in this paper, the results are worse than without feature-selection.

When using IG > 0.01 the results become better.

Is there a recommended way for selecting this IG threshold ?

Should we perform a grid-search (with Training and Cross Validation data set) to determine the best IG threshold ? In which case this threshold becomes a meta-parameter such as the C of the SVM model.

Should we stick to the simple IG > 0 even if our generalization results are worse ?

Best Answer

I am not sure there is a golden method to calculate the IG threshold.

I assume you have a two column vector (or map or dictionary, etc.) that contains words in col1 and their IG's in col2. Why not sort descending by IG, take top 10, 100, 1000, etc. and plot something like this:

IG vs word count plot

This will give you a visually intuitive threshold point for your IG.

Further information provided on this link.