Solved – Continually updating naive Bayes classifier

text mining

I am attempting to use a Naive Bayes classifier to classify text. To accomplish this I have created an Excel sheet with a binary distribution for three variables. The workbook can be found here. Assuming that my math is correct, my questions are:

  1. Can my training set can be expanded as I classify new inputs? In other words, every time I check a classification that the model has produced, I can add it to the training model but then I might have an uneven number of examples for each class. Is this a problem?
  2. How can I incorporate a prior distribution to the equation? If I for example know from prior data that Class A is twice as likely than Class B?
  3. How can I incorporate tf–idf to the equation? I can analyze all the data sets a priori and the frequencies of each word in both the corpus and each document, but am unsure how to incorporate this into the Classifier.

Thanks in advance for everyone help.

AMAS

Best Answer

If you're doing this yourself, as opposed to using a package, it's fairly straightforward to do all three of these things. If you're using an off-the-shelf implementation, it would depend on what you were using as to whether this was possible. I'm going to assume the attributes take categorical values (as most simple versions of NB will) in these explanations. I'll describe for a single continuous valued feature (the frequency of some word in your text, normalised by the document length, say $f_w$), with three bins in the histogram:

very rare: $0 <= f_w < 0.001$

rare: $0.001 <= f_w < 0.01$

frequent: $0.01 <= f_w <= 1$

thus for our word $w$, its feature value must always be in exactly one of these three intervals. Now to answer your questions:

1) The parameters can be updated as you see new examples by maintaining counts over the three bins for all the documents you've seen. The probability of the bin in subsequent documents is this count divided by the sum of the counts. Each time you see a document, increment the counts

2) Technically the NB model is the likelihood---you train a model as above for each class. Multiply the likelihood by the prior to get the posterior probability of the class, but be aware that in NB your likelihoods often swamp your priors because the assumption of independence leads to very sharp distributions (see this paper by Hand and Yu)

3) Easy---just change the feature $f_w$ from the normalised frequency of $w$ to the $tf-idf$ of $w$. Be aware that you'll need to specify new, sensible bins in your histogram if you stick with the categorical approach (the alternative is to specify continuous distributions on your features, but it's tricky to come up with good ones)

Related Question