Solved – How to improve feature selection for the Naive Bayes Classifier

classificationnaive bayesnatural languagetext mining

I am classifying companies into two classes ( a particular business type, or not that business type ), using a Naive Bayes Classifier. Specifically, I'm using PHP and PHP NLP Tools.

I have two reasonably accurate lists of companies to use as input for training and test sets ( business type, and not the business type ). The accuracy of these two lists has been validated internally, and by our customer who we are doing the work for.

I generate a document for each company using various features from a HUGE database of potential features describing these businesses ( industries they work in, certifications they have, products and services they sell, etc.. ).

I am mostly guessing at what features to throw at the system, and not seeing a lot of improvement when I make changes. I've read a lot about programmatically ( statistically ) selecting features that are (in laymen's terms) very significant to my "business type" companies, but not important to the non business type companies.

I'm curious what methods exist for this type of feature selection/optimization. For instance … the entire set has roughly 25k different products and services. I recognize issues with doing this, and can see how selecting more 'relevant' features to my good set would be helpful.

The most simple methods I can try first would be very helpful.

Best Answer

Instead of using Naive Bayes have you thought to use another classifier as for instance a 4.5 decision tree? Here you a have a built-in feature selection technique (information gain) that would make it possible to visualize (by looking at the inner node splits) which features contribute to which class. However, i'm not sure if decision trees are interpretable with 25k features. What about classic approaches like PCA principle component analysis) or factor analysis? What about forward/backward feature ellimination? Give them a try. Here is an comparision of several FS-techniques:

http://www.researchgate.net/profile/Marcin_Blachnik/publication/226749677_Comparison_of_Various_Feature_Selection_Methods_in_Application_to_Prototype_Best_Rules/links/0046351d6cb319daed000000.pdf