Solved – Identifying most important words in text classification

machine learningtext mining

I know one can use tf-idf to distinguish important words based on the number of times a word appears in a document relative to the number of times it appears in the entire collection of documents. However, how does one identify which words are most important in distinguishing the positive class from the negative class?

Best Answer

Create a simple Naive Bayes classifier over a corpus of positive and negative sentiments , and weights words with probability of it is positive or not , for example word "fantastic" appears 80 times in positives and 20 times in negatives so it has probability of 0.8 for word "fantastic" that sentiment is positive , after assign weight to words , you can create a bag of word of a new sentiment, ignore non positive and negative words , calculate probability of that sentence for each positive and negative word as follows :

$$ p(negative|s) = p(negative)p(w_1,w_2,w_3,...,w_n) $$ $$ p(positive|s) = p(positive)p(w_1,w_2,w_3,...,w_n) $$

Which one is greater has more probability for $$ s=w_1,w_2,w_3,...,w_n $$

There are many good improvement that you can use here, bag of word ignores positions of words and dependency between them , you can don't ignore them , and you can use discriminative models over generative models(like Naive Bayes) or use some other classifiers like SVM or ... ,