I want to find the similarity between a document with documents coded as TF-IDF in a pickle file (Python). TF-IDF is done as offline so there is no problem, but when I send a new document for similarity check it takes around 2 minute while I need something real-time (< 2 seconds). For this purpose I used the following code:
for p_tf in p_tfidf:
temp_similarity = 0
for item in p_tf:
(score,word) = item
if word in input_text:
temp_similarity += score
similarity_score.append([temp_similarity, id])
Any clue how to improve system?
Best Answer
You can make use of sklearn.feature_extraction.text.TfidfVectorizer
A simple example:
Result:
As a side note, you can remove "stop words" like "on", by passing
stop_words='english'
parameter:Edit:
Result: