Solved – Software or libraries to create doc-term matrix

natural languagetext mining

does anyone know some Java libraries to create the document-term matrix for a large number (50,000) of documents ? I wish this library encompasses preprocessing functionalities, like stop-word and punctuation removal, stemming, etc. What's more, I wish to use TF*IDF weighting scheme, and normalization functionality. Preferably, I prefer Java libraries for the convenience of development.

Thanks very much for any recommendation.

Best Answer

Weka offers this functionality in Java. Start Weka and open the Explorer. Then load your dataset and apply the StringToWordVector filter. This filter can create a doc term matrix (either binary or by frequency), do IDF, stopword removal, stemming, normalization, punctuation removal and more.