I've got a dataset which represents 1000 documents and all the words that appear in it. So the rows represent the documents and the columns represent the words. So for example, the value in cell $(i,j)$ stands for the times word $j$ occurs in document $i$. Now, I have to find 'weights' of the words, using tf/idf method, but I actually don't know how to do this. Can someone please help me out?
Solved – Term frequency/inverse document frequency (TF/IDF): weighting
data miningfeature selectionr
Best Answer
Wikipedia has a good article on the topic, complete with formulas. The values in your matrix are the term frequencies. You just need to find the idf:
(log((total documents)/(number of docs with the term))
and multiple the 2 values.In R, you could do so as follows:
Here's the datasets:
You can also look at the idf of each term: