I have a set of documents
like:
D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."
and a set of words
like:
"sky","land","sea","water","sun","moon"
I want to create a matrix like this:
x D1 D2 D3
sky tf-idf 0 tf-idf
land 0 0 0
sea 0 0 0
water 0 0 0
sun 0 tf-idf tf-idf
moon 0 0 0
Something like the example table given here: http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html. In the given link, it uses the same words from the document but I need to use the set of words
that I have mentioned.
If the particular word is present in the document then I put the tf-idf
values, else I put a 0
in the matrix.
Any idea how I might build some sort of matrix like this? Python will be best but R also appreciated.
I am using the following code but am not sure whether I am doing the right thing or not. My code is:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
train_set = "The sky is blue.", "The sun is bright.", "The sun in the sky is bright." #Documents
test_set = ["sky","land","sea","water","sun","moon"] #Query
stopWords = stopwords.words('english')
vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
#print 'Fit Vectorizer to train set', trainVectorizerArray
#print 'Transform Vectorizer to test set', testVectorizerArray
transformer.fit(trainVectorizerArray)
#print
#print transformer.transform(trainVectorizerArray).toarray()
transformer.fit(testVectorizerArray)
#print
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()
I am getting very absurd results like this (values are only 0
and 1
while I am expecting values between 0 and 1).
[[ 0. 0. 1. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 1.]
[ 0. 0. 0. 0.]
[ 1. 0. 0. 0.]]
I am also open to other libraries for calculating tf-idf
. I just want a correct matrix which I mentioned above.
Best Answer
Have a look at gensim or scikit-learn.
Code
After fitting the model, you can transform your out of sample documents.
However, it sounds like what you really want to do given your comments is evaluate the tf-idf of the original documents in terms of the "
test_set
" as the vocabulary? It's unclear to me what you're after I guess. If that's the case though then something likeGives you what you want I think.