does anyone know some Java libraries to create the document-term matrix for a large number (50,000) of documents ? I wish this library encompasses preprocessing functionalities, like stop-word and punctuation removal, stemming, etc. What's more, I wish to use TF*IDF weighting scheme, and normalization functionality. Preferably, I prefer Java libraries for the convenience of development.
Thanks very much for any recommendation.
Best Answer
Weka offers this functionality in Java. Start Weka and open the
Explorer
. Then load your dataset and apply theStringToWordVector
filter. This filter can create a doc term matrix (either binary or by frequency), do IDF, stopword removal, stemming, normalization, punctuation removal and more.