Solved – How to calculate tf-idf for a single term

data miningmathematical-statisticspythonscikit learntf-idf

I am following the tf-idf method described in this paper: Measuring, Predicting and Visualizing Short-Term Change in Word Representation and Usage in VKontakte Social Network.

In the paper I have linked above (see equation 2 in the paper), they have got only a single tf-idf value for each word (w) for each week (t) as follows.

For example, consider the below graph that I took from the above paper.

It shows how tf-idf value of the word putin changed over weeks. i.e. one tf-idf value for the word putin in each week.

I would like to implement the tf-idf approach that they have suggested. In other words, I would like to calculate a single tf-idf value the word in each time period. However, I am struggling a way to implement this in python.

Currently I am using sklearn library to implement this. However, in the tutorials that I follow, a word can have mutiple tf-idf values in a t timeperiod. For example, consider the below documents in t timeframe.

The tf-idf values we get are as follows.

For example, consider the word "method", it has 3 tf-idf scores according to my sklearn implementation. Hence, I am not sure if I am following the paper correctly.

My preferred language is python.

I am happy to provide more details if needed.

Best Answer

The modeling strategy suggested in the paper refers to temporal representation(both frequency and context) of words.
From what I understand, they attempt to learn the changes in these representations across time.
One such representation is based on the tf-idf method.
In the mentioned equation, the parameters $t$ indicates week's corpus.
This means that each word, will have $n$ tf-idf representations - one per each of the $n$ weeks relevant to the modeling.

One way implementing this if fitting a new tf-idf transformer per each week, and keeping each (word,week) representation in a dictionary.

Then its possible viewing the changes in each words representation across time.

EDIT:
The Word Usage Dynamics statistic being used is per all posts(documents) and not per document. Meaning each word should have only one value per week From what I gathered, there is no straight forward implementation for this in Sklearn, but possibly in NLTK/Genism.

Still, it seems quiet simple implementing on your own:

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, _document_frequency
 corpus = [
 'This is the first document.',
 'This document is the second document.',
 'And this is the third one.',
 'Is this the first document?',
 'Is this the second cow?, why is it blue?',
  ]

 count_vec = CountVectorizer(binary=False)
 count_df = count_vec.fit_transform(corpus)
 transformer = TfidfTransformer(use_idf=True, smooth_idf=False)
 X1 = transformer.fit_transform(count_df)
 posts_cnt = len(corpus)

 ##calculating tf-idf per word on all documents - using sklearn _document_frequency
 vals = [math.log(x) * math.log(posts_cnt/float(y)) for x, y in 
 zip(count_df.sum(axis=0).tolist()[0], _document_frequency(X1))]
 ## mapping tf-idf vals to original words
 {k: vals[v] for k, v in count_vec.vocabulary_.items()}

Related Solutions

Solved – Term frequency/inverse document frequency (TF/IDF): weighting

Wikipedia has a good article on the topic, complete with formulas. The values in your matrix are the term frequencies. You just need to find the idf: (log((total documents)/(number of docs with the term)) and multiple the 2 values.

In R, you could do so as follows:

set.seed(42)
d <- data.frame(w=sample(LETTERS, 50, replace=TRUE))
d <- model.matrix(~0+w, data=d)

tf <- d
idf <- log(nrow(d)/colSums(d))
tfidf <- d

for(word in names(idf)){
  tfidf[,word] <- tf[,word] * idf[word]
}

Here's the datasets:

> colSums(d)
wA wC wD wF wG wH wJ wK wL wM wN wO wP wQ wR wS wT wV wX wY wZ 
 3  1  3  1  1  1  1  2  4  2  2  1  1  3  2  2  2  4  5  5  4 
> head(d)
  wA wC wD wF wG wH wJ wK wL wM wN wO wP wQ wR wS wT wV wX wY wZ
1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0
2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0
3  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0
5  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0
6  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0
> head(round(tfidf, 2))
  wA wC wD wF wG   wH wJ wK wL wM   wN wO wP   wQ wR wS wT   wV  wX  wY wZ
1  0  0  0  0  0 0.00  0  0  0  0 0.00  0  0 0.00  0  0  0 0.00 2.3 0.0  0
2  0  0  0  0  0 0.00  0  0  0  0 0.00  0  0 0.00  0  0  0 0.00 0.0 2.3  0
3  0  0  0  0  0 3.91  0  0  0  0 0.00  0  0 0.00  0  0  0 0.00 0.0 0.0  0
4  0  0  0  0  0 0.00  0  0  0  0 0.00  0  0 0.00  0  0  0 2.53 0.0 0.0  0
5  0  0  0  0  0 0.00  0  0  0  0 0.00  0  0 2.81  0  0  0 0.00 0.0 0.0  0
6  0  0  0  0  0 0.00  0  0  0  0 3.22  0  0 0.00  0  0  0 0.00 0.0 0.0  0

You can also look at the idf of each term:

> log(nrow(d)/colSums(d))
      wA       wC       wD       wF       wG       wH       wJ       wK       wL       wM       wN       wO       wP       wQ       wR       wS       wT       wV       wX       wY       wZ 
2.813411 3.912023 2.813411 3.912023 3.912023 3.912023 3.912023 3.218876 2.525729 3.218876 3.218876 3.912023 3.912023 2.813411 3.218876 3.218876 3.218876 2.525729 2.302585 2.302585 2.525729

Solved – Subset documents based on tfidf weights

Here is some code that can do what you want.

from sklearn.feature_extraction.text import TfidfVectorizer

f = open('filename.csv', 'r')

texts = list()
for l in f:
    texts.append(l.split(',')[3])

matrix = TfidfVectorizer().fit_transform(texts)
total_tf_idf = matrix.sum(axis = 1)

threshold = 3
indexes_above_threshold = [i for i in range(len(total_tf_idf)) if total_tf_idf[i] > threshold]
matrix_above_threshold = matrix[indexes_above_threshold, :]

The parts to focus on are the creation of total_tf_idf which uses the sum function, indexes_above_threshold which gets the indexes you want, and matrix_above_threshold which is the final matrix you want.

I hope this helps. Let me know if anything is unclear.

Best Answer

Related Solutions

Solved – Term frequency/inverse document frequency (TF/IDF): weighting

Solved – Subset documents based on tfidf weights

Related Question