Solved – Software or libraries to create doc-term matrix

natural languagetext mining

does anyone know some Java libraries to create the document-term matrix for a large number (50,000) of documents ? I wish this library encompasses preprocessing functionalities, like stop-word and punctuation removal, stemming, etc. What's more, I wish to use TF*IDF weighting scheme, and normalization functionality. Preferably, I prefer Java libraries for the convenience of development.

Thanks very much for any recommendation.

Best Answer

Weka offers this functionality in Java. Start Weka and open the Explorer. Then load your dataset and apply the StringToWordVector filter. This filter can create a doc term matrix (either binary or by frequency), do IDF, stopword removal, stemming, normalization, punctuation removal and more.

Related Solutions

Solved – Different size of vocabulary made by Weka and R’s tm

Did you apply the StringToWordVector in Weka? If so, then you did more than just punctuation and stop-words removal. StringToWordVector outputs only the doc-term matrix of the input text files, so once the above mentioned preprocessing is done Weka will create 1 term for each unique word. 35k terms sounds logical for 40k texts.

The preprocessing in R seems to have been only the punctuation and stop-words removal. So 40k documents results in 1M words, but not unique words. Are your text files approximately 25 words on average? If this is not the case, then there is something else going on indeed.

Solved – Create a matrix of tf-idf values from documents

Have a look at gensim or scikit-learn.

Code

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords


train_set = ["The sky is blue.", "The sun is bright.", "The sun in the sky is bright."]
stop_words = stopwords.words('english')

transformer = TfidfVectorizer(stop_words=stop_words)
transformer.fit_transform(train_set).todense()

After fitting the model, you can transform your out of sample documents.

transformer.transform(test_set).todense()

However, it sounds like what you really want to do given your comments is evaluate the tf-idf of the original documents in terms of the "test_set" as the vocabulary? It's unclear to me what you're after I guess. If that's the case though then something like

transformer = TfidfVectorizer(stop_words=stop_words, vocabulary=test_set)
transformer.fit_transform(train_set).todense().T

Gives you what you want I think.

Best Answer

Related Solutions

Solved – Different size of vocabulary made by Weka and R’s tm

Solved – Create a matrix of tf-idf values from documents

Related Question