Solved – Document classification with naive Bayes algorithm

bayesianclassificationrtext mining

I want to build a document classifier in R, using the Naive Bayes approach.

Here are steps, that I've done so far:

  • I have corpus with about 30 documents from 2 authors (Classes are: "target author" and "other author").
  • "Vocabulary" (training set) has been pre-processed (removed numbers, removed punctuation, words to lower case, removed stop words, stem documents, strip whitespace), and I am considering only frequent words (top 700).
  • Now I have matrix which looks like:

    enter image description here

Then I trained my classifier using Bayes using some existing R library, e1071.

Here are my questions:

I want to test my classifier on other documents that were not part of the training set.

  • How to prepare my data matrix? What if those other documents don't contain all the words (attributes) from my training set? Should I put dummy columns there (e.g., with value=0)?
  • Does the position of the words (columns order) matter?

Here is an example:

Training attributes:

"wild"  "wind"  "woman"

Testing attributes:

"woman" "wind" "wild"  

Is this ok, or should columns be in the same order as in training matrix?

Best Answer

You should construct your features (in this case, the words you're including as descriptors of each document) based only on your training set. This will calculate the probability of having a certain word given that it belongs to a particular class: $P(w_i|c_k)$. In case you're wondering, this probability is needed when calculating the probability of a document belonging to some class: $P(c_{k}|\text{document})$

When you want to predict the class for a new document in the test set, ignore the words that are not included in the training set. The reason is that you can't use the test set for anything other than testing your predictions. Furthermore, the training set must be representative of the test set. Otherwise, you won't get a good classifier. Therefore, it is to be expected that the majority of the words in the test set are also included in the training set.

Some people add an extra column for unknown words and try to calculate a probability of such words given a certain class: $P(\text{unknown} | c_{i})$. I don't think this is necessary or even appropriate because in order to obtain this probability, you need to peek at the test set. That's something you must never do.

Related Question