Solved – Text classification based on keywords

classificationmachine learningnaive bayestext mining

Totally new to ML so please bear with me. I'm trying to classify text from certain email messages and RSS feed entries. The texts should be classified as either relevant or irrelevant. The decision whether some text is relevant or not should be based on whether it contains certain keywords. I already have a set of these keywords labeled as "relevant" and "irrelevant".

Now my lack of ML knowledge makes me think this should be a simple comparison of keywords found in text. For example, if the text contains the word "blue", which is labeled as "relevant", the text itself is relevant. If it contains "red" (labeled as "irrelevant") it should be classed as irrelevant. Naturally, multiple occurrences should be compared and the winner determines the relevancy. Optionally, keywords could specify weights so that few mentions of an important relevant keyword outweigh many mentions of an irrelevant keyword.

On the other hand, from what I'm reading, I could run this through a naive Bayes classifier. The question is, can such a classifier accept single words rather than entire texts to analyze? Can it also give more weight to certain words over others?

Best Answer

Take a look at generalized expectation. It's a lightly supervised classification algorithm that starts from keywords and extends from there.

Single word can always be treated as a document which contains only one word. So conceptually there's no difference. If you're using a model where the features are words itself (NB or logistic regression), you can also read off the feature weight. But it's not clear how to interpret them.