Solved – Search in TF-IDF

machine learningpythontext mining

I want to find the similarity between a document with documents coded as TF-IDF in a pickle file (Python). TF-IDF is done as offline so there is no problem, but when I send a new document for similarity check it takes around 2 minute while I need something real-time (< 2 seconds). For this purpose I used the following code:

for p_tf in p_tfidf:
    temp_similarity = 0
    for item in p_tf:
        (score,word) = item
        if word in input_text:
            temp_similarity += score

    similarity_score.append([temp_similarity, id])

Any clue how to improve system?

Best Answer

You can make use of sklearn.feature_extraction.text.TfidfVectorizer

A simple example:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1)

my_phrases = ["boring answer phrase",
              "exciting phrase",
              "phrase on stackoverflow",
              "answer on stackoverflow"]

my_features = vectorizer.fit_transform(my_phrases)

Result:

>>> import numpy as np
>>> np.set_printoptions(precision=4)
>>> my_features.A
array([[ 0.5535,  0.702 ,  0.    ,  0.    ,  0.4481,  0.    ],
       [ 0.    ,  0.    ,  0.8429,  0.    ,  0.538 ,  0.    ],
       [ 0.    ,  0.    ,  0.    ,  0.6137,  0.4968,  0.6137],
       [ 0.5774,  0.    ,  0.    ,  0.5774,  0.    ,  0.5774]])
>>> vectorizer.get_feature_names()
[u'answer', u'boring', u'exciting', u'on', u'phrase', u'stackoverflow']

As a side note, you can remove "stop words" like "on", by passing stop_words='english' parameter:

vectorizer = TfidfVectorizer(min_df=1, stop_words='english')

Edit:

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# each phrase here could be document in your list 
# of documents
my_phrases = ["boring answer phrase",
              "exciting phrase",
              "phrase on stackoverflow",
              "answer on stackoverflow"]

#  and you want to find the most similar document
#  to this document             
phrase = ["stackoverflow answer"]

# You could do it like this:
vectorizer = TfidfVectorizer(min_df=1, stop_words='english')
all_phrases = phrase + my_phrases
my_features = vectorizer.fit_transform(all_phrases)
scores = (my_features[0, :] * my_features[1:, :].T).A[0]
best_score = np.argmax(scores)
answer = my_phrases[best_score]

Result:

>>> answer
'answer on stackoverflow'
Related Question