Solved – Subset documents based on tfidf weights

pythonscikit learn

I am new with text mining therefore please bare me if this question sound too easy for others. But I tried to find out the solution with no success.

I am working on a project of document classification. I want to use tfidf scores to filter documents for me. Like, I want to use it as a way for feature selection. My goal is to remove those documents whose collective tfidf scores(adding tfidf scores for each word in the document) is less

I am using python sklearn library for that.

Here is my code

## document is in form of dictionary with id as index or filename and token as tokens from the unstructured text in the form of list.
##no of rows = no. of documents are 95318
for row in csvFile:
   dic["token"].append(row[0])
   dic["id"].append(row[1])


tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True,min_df=5, max_df=94000,decode_error='ignore',norm="l2")
tdm = tfidf_vectorizer.fit_transform(dic["token"])
print(tdm.shape) ##(95318, 20266)
feature_names = tfidf_vectorizer.get_feature_names()

now I want to get an output like:

docName|features_names 1|features_names 2 | tfidf
doc#1  | species        | genes           | 20.22  
doc#2  | average        | enlargment      | 19.12

Here, the tfidf at the last column is a sum of tfidf of feature_name1+ feature_name .

In this way I will able to get the subset of those documents out of 95318 documents with higher tfidf weight.

I could not find any related article that can explain, how can I achieve that

Best Answer

Here is some code that can do what you want.

from sklearn.feature_extraction.text import TfidfVectorizer

f = open('filename.csv', 'r')

texts = list()
for l in f:
    texts.append(l.split(',')[3])

matrix = TfidfVectorizer().fit_transform(texts)
total_tf_idf = matrix.sum(axis = 1)

threshold = 3
indexes_above_threshold = [i for i in range(len(total_tf_idf)) if total_tf_idf[i] > threshold]
matrix_above_threshold = matrix[indexes_above_threshold, :]

The parts to focus on are the creation of total_tf_idf which uses the sum function, indexes_above_threshold which gets the indexes you want, and matrix_above_threshold which is the final matrix you want.

I hope this helps. Let me know if anything is unclear.