Solved – Different size of vocabulary made by Weka and R’s tm

javanatural languagertext mining

I own around 40,000 text files for preprocessing (in purpose of document classification). I used R (with tm package) for prototype and now looking for a equivalent Java library for products.

However, for very fundamental tasks, i.e. text preprocessing, I found a very strange problem. That, with Weka, I apply punctuation and stop-words removal, and the same operations with R. Basically, the generated vocabulary (terms) size should be relatively the same. However, weka returns a vocabulary (attributes in arff file) with only 35,000 terms, while in R, there are more than 1 million distinct terms.

Can anyone help me understand this problem, or introduces me some more reliable Java libraries for text preprocessing?

Best Answer

Did you apply the StringToWordVector in Weka? If so, then you did more than just punctuation and stop-words removal. StringToWordVector outputs only the doc-term matrix of the input text files, so once the above mentioned preprocessing is done Weka will create 1 term for each unique word. 35k terms sounds logical for 40k texts.

The preprocessing in R seems to have been only the punctuation and stop-words removal. So 40k documents results in 1M words, but not unique words. Are your text files approximately 25 words on average? If this is not the case, then there is something else going on indeed.