I am following the tf-idf
method described in this paper: Measuring, Predicting and Visualizing Short-Term Change in Word Representation and Usage in VKontakte Social Network.
In the paper I have linked above (see equation 2 in the paper), they have got only a single tf-idf
value for each word (w) for each week (t) as follows.
For example, consider the below graph that I took from the above paper.
It shows how tf-idf value of the word putin
changed over weeks. i.e. one tf-idf value for the word putin
in each week.
I would like to implement the tf-idf
approach that they have suggested. In other words, I would like to calculate a single tf-idf
value the word in each time period. However, I am struggling a way to implement this in python.
Currently I am using sklearn
library to implement this. However, in the tutorials that I follow, a word can have mutiple tf-idf
values in a t timeperiod. For example, consider the below documents in t timeframe.
The tf-idf values we get are as follows.
For example, consider the word "method", it has 3 tf-idf scores according to my sklearn
implementation. Hence, I am not sure if I am following the paper correctly.
My preferred language is python.
I am happy to provide more details if needed.
Best Answer
The modeling strategy suggested in the paper refers to temporal representation(both frequency and context) of words.
From what I understand, they attempt to learn the changes in these representations across time.
One such representation is based on the tf-idf method.
In the mentioned equation, the parameters $t$ indicates week's corpus.
This means that each word, will have $n$ tf-idf representations - one per each of the $n$ weeks relevant to the modeling.
One way implementing this if fitting a new tf-idf transformer per each week, and keeping each (word,week) representation in a dictionary.
Then its possible viewing the changes in each words representation across time.
EDIT:
The Word Usage Dynamics statistic being used is per all posts(documents) and not per document. Meaning each word should have only one value per week From what I gathered, there is no straight forward implementation for this in Sklearn, but possibly in NLTK/Genism.
Still, it seems quiet simple implementing on your own: