Solved – How to calculate tf-idf for a single term

data miningmathematical-statisticspythonscikit learntf-idf

I am following the tf-idf method described in this paper: Measuring, Predicting and Visualizing Short-Term Change in Word Representation and Usage in VKontakte Social Network.

In the paper I have linked above (see equation 2 in the paper), they have got only a single tf-idf value for each word (w) for each week (t) as follows.

enter image description here

For example, consider the below graph that I took from the above paper.

enter image description here

It shows how tf-idf value of the word putin changed over weeks. i.e. one tf-idf value for the word putin in each week.

I would like to implement the tf-idf approach that they have suggested. In other words, I would like to calculate a single tf-idf value the word in each time period. However, I am struggling a way to implement this in python.

Currently I am using sklearn library to implement this. However, in the tutorials that I follow, a word can have mutiple tf-idf values in a t timeperiod. For example, consider the below documents in t timeframe.

enter image description here

The tf-idf values we get are as follows.

enter image description here

For example, consider the word "method", it has 3 tf-idf scores according to my sklearn implementation. Hence, I am not sure if I am following the paper correctly.

My preferred language is python.

I am happy to provide more details if needed.

Best Answer

The modeling strategy suggested in the paper refers to temporal representation(both frequency and context) of words.
From what I understand, they attempt to learn the changes in these representations across time.
One such representation is based on the tf-idf method.
In the mentioned equation, the parameters $t$ indicates week's corpus.
This means that each word, will have $n$ tf-idf representations - one per each of the $n$ weeks relevant to the modeling.

One way implementing this if fitting a new tf-idf transformer per each week, and keeping each (word,week) representation in a dictionary.

Then its possible viewing the changes in each words representation across time.

EDIT:
The Word Usage Dynamics statistic being used is per all posts(documents) and not per document. Meaning each word should have only one value per week From what I gathered, there is no straight forward implementation for this in Sklearn, but possibly in NLTK/Genism.

Still, it seems quiet simple implementing on your own:

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, _document_frequency
 corpus = [
 'This is the first document.',
 'This document is the second document.',
 'And this is the third one.',
 'Is this the first document?',
 'Is this the second cow?, why is it blue?',
  ]

 count_vec = CountVectorizer(binary=False)
 count_df = count_vec.fit_transform(corpus)
 transformer = TfidfTransformer(use_idf=True, smooth_idf=False)
 X1 = transformer.fit_transform(count_df)
 posts_cnt = len(corpus)

 ##calculating tf-idf per word on all documents - using sklearn _document_frequency
 vals = [math.log(x) * math.log(posts_cnt/float(y)) for x, y in 
 zip(count_df.sum(axis=0).tolist()[0], _document_frequency(X1))]
 ## mapping tf-idf vals to original words
 {k: vals[v] for k, v in count_vec.vocabulary_.items()}