Solved – Create a matrix of tf-idf values from documents

information retrievalmachine learningpythonrtext mining

I have a set of documents like:

D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."

and a set of words like:

"sky","land","sea","water","sun","moon"

I want to create a matrix like this:

   x        D1           D2         D3
sky         tf-idf       0          tf-idf
land        0            0          0
sea         0            0          0
water       0            0          0
sun         0            tf-idf     tf-idf
moon        0            0          0

Something like the example table given here: http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html. In the given link, it uses the same words from the document but I need to use the set of words that I have mentioned.

If the particular word is present in the document then I put the tf-idf values, else I put a 0 in the matrix.

Any idea how I might build some sort of matrix like this? Python will be best but R also appreciated.

I am using the following code but am not sure whether I am doing the right thing or not. My code is:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords


train_set = "The sky is blue.", "The sun is bright.", "The sun in the sky is bright." #Documents
test_set = ["sky","land","sea","water","sun","moon"] #Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
#print 'Fit Vectorizer to train set', trainVectorizerArray
#print 'Transform Vectorizer to test set', testVectorizerArray

transformer.fit(trainVectorizerArray)
#print
#print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)
#print 
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

I am getting very absurd results like this (values are only 0 and 1 while I am expecting values between 0 and 1).

[[ 0.  0.  1.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  0.  0.]
 [ 1.  0.  0.  0.]]   

I am also open to other libraries for calculating tf-idf. I just want a correct matrix which I mentioned above.

Best Answer

Have a look at gensim or scikit-learn.

Code

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords


train_set = ["The sky is blue.", "The sun is bright.", "The sun in the sky is bright."]
stop_words = stopwords.words('english')

transformer = TfidfVectorizer(stop_words=stop_words)
transformer.fit_transform(train_set).todense()

After fitting the model, you can transform your out of sample documents.

transformer.transform(test_set).todense()

However, it sounds like what you really want to do given your comments is evaluate the tf-idf of the original documents in terms of the "test_set" as the vocabulary? It's unclear to me what you're after I guess. If that's the case though then something like

transformer = TfidfVectorizer(stop_words=stop_words, vocabulary=test_set)
transformer.fit_transform(train_set).todense().T

Gives you what you want I think.