Exercise 1.3 from Mining of Massive Data Sets book

data analysisdata mining

Hello there is a question given in Mining of Massive Data Sets book
http://infolab.stanford.edu/~ullman/mmds/ch1.pdf
it is on page 15 exercise 1.3.2

My solution is following:
as there are $10$ million documents and word occurs in $320$ of them so
Inverse Document Frequency = $\log(10*10^{6}/320)$;

Now as per question…

case a) word if appears once then $TF=1/15$ (as given $15$ is the max occurrence of word in a document)

case b) $TF = 5/15$ as given word appears $5$ times (maximum occurrence pre defined to be $15$ times)

so for case a) $TF.IDF$ score $= \log(10^{7}/320)*(1/15)$

and for case b) $TF.IDF$ score $= \log(10^{7}/320)*(5/15)$

Is this solution correct? I just want to understand if I have understood the concept correctly or not.

Best Answer

You're on the right path...according to the definition of $IDF$, $IDF_i=\log_2 (N/n_i)$, so your answers should be

Case A: $TDF.IF \text { score} = \log_2 (10^{7}/320) * (1/15) = \log_2 (6250/3)$

Case B: $TDF.IF \text { score} = \log_2 (10^{7}/320) * (5/15) = \log_2 (31250/3)$