Solved – How to do feature selection for learning from positive and unlabeled examples

classificationdata miningfeature selectiontext mining

I have a binary classification task for German webpages for which I only have positive examples. That is why I use learning from positive and unlabeled examples as described on this page, also known as partially supervised learning.

At the moment, I'm just excluding very rare features which occur only once and very frequent features, i.e., stop words and those which occur in more than 50% of all positive examples.

As for classifiers, I want to test Naive Bayes (the positive example webpages are quite short, so I favor the multivariate version with boolean features) and Support Vector Machines (SVM). I've read that feature selection is not so important for SVM as it does not effect the classification results very much. Is that true?

As I'm not so familiar with feature selection algorithms, can you recommend an algorithm that works especially well with features from positive examples only and which generally yields better results than just cutting off very rare and very frequent features? If it is not possible to give a general answer to this question and if it highly depends on my data set, then please say so as well. Thanks a lot!

Best Answer

My experience with SVM has shown it to be fairly robust to uninformative features. Other classifiers, like Naïve Bayes, for example, tend to be more sensitive to such features, making feature selection a relatively more important part of the classification workflow. In terms of feature selection algorithms, I'm a fan of using information theory-based metrics, like mutual information. When I use this, I usually generate a graph of the distribution of the input feature's mutual information during cross-validation, and then manually select some cut-off.

Related Question