Solved – How to combine two tfidf sparse vectors

scikit learntf-idf

Say that I have two document collections that I have created a tf-idf sparse vector for each one using TfidfVectorizer. How could I combine those two vectors into one that would resemble the tfidf of the union of the two collection?

How could I approach this since the two collection will probably have different features?

Best Answer

Why don't you calculate them from scratch?

An important part of the Vector Space Model is the dictionary. In your case you are having two collections and therefore two dictionaries that may have common elements or may not. In any case you need to merge the two dictionaries and then calculate TF-IDF weights for each of your documents.

Otherwise I don't see what semantics merging would have in a different dimensional space.

Related Solutions

Solved – Subset documents based on tfidf weights

Here is some code that can do what you want.

from sklearn.feature_extraction.text import TfidfVectorizer

f = open('filename.csv', 'r')

texts = list()
for l in f:
    texts.append(l.split(',')[3])

matrix = TfidfVectorizer().fit_transform(texts)
total_tf_idf = matrix.sum(axis = 1)

threshold = 3
indexes_above_threshold = [i for i in range(len(total_tf_idf)) if total_tf_idf[i] > threshold]
matrix_above_threshold = matrix[indexes_above_threshold, :]

The parts to focus on are the creation of total_tf_idf which uses the sum function, indexes_above_threshold which gets the indexes you want, and matrix_above_threshold which is the final matrix you want.

I hope this helps. Let me know if anything is unclear.

Solved – TFIDF for feature selection method for unlabeled text documents

Note that this has some overlap with an earlier, somewhat similar question (where I suggested to group the words in the TF-IDF matrix by their covariance, and selecting the most frequent word in each group as the best feature).

Typical approaches are to just take some $n$ top most frequent words (or some top fraction $x$), which you can do, as you suggest, after various forms of TF-IDF scaling those word frequencies. While spectral analysis and clustering (e.g., of word embeddings, instead of TF-IDF values, and then choosing/selecting the most central word in each cluster) indeed have been suggested recently (2012-2016) to improve unsupervised word feature selection, they are not very common, however (and way more complex to set up than a quick TF-IDF-ranked frequency filter).

As to measuring the "correct" choice of $n$ (or $x$), if all your work is unsupervised, you can only measure intrinsic correctness (c.f., model perplexity); Or you need to evaluate your unsupervised results against some supervised task with a simple setup, e.g., as is common practice when evaluating word embeddings.

Best Answer

Related Solutions

Solved – Subset documents based on tfidf weights

Solved – TFIDF for feature selection method for unlabeled text documents

Related Question