Solved – Understanding and interpreting the output of Spark’s TF-IDF implementation

machine learningspark-mllibtext miningtf-idf

I am currently trying to understand what the example code provided as part of Spark's TF-IDF implementation is doing.

Given the example code block (taken from Spark's Github repository)

val sentenceData = sess.createDataFrame(Seq(
  (0.0, "Hi I heard about Spark"),
  (0.0, "I wish Java could use case classes"),
  (1.0, "Logistic regression models are neat")
)).toDF("label", "sentence")

val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
wordsData.show(false)

val hashingTF = new HashingTF()
  .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)

val featurizedData = hashingTF.transform(wordsData)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
featurizedData.show(false)

val rescaledData = idfModel.transform(featurizedData)
rescaledData.select("label", "features").show(false)

And the output for wordsData:

+-----+-----------------------------------+------------------------------------------+
|label|sentence                           |words                                     |
+-----+-----------------------------------+------------------------------------------+
|0.0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |
|0.0  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|
|1.0  |Logistic regression models are neat|[logistic, regression, models, are, neat] |
+-----+-----------------------------------+------------------------------------------+

And the output for featurizedData:

+-----+-----------------------------------+------------------------------------------+-----------------------------------------+
|label|sentence                           |words                                     |rawFeatures                              |
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+
|0.0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |(20,[0,5,9,17],[1.0,1.0,1.0,2.0])        |
|0.0  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(20,[2,7,9,13,15],[1.0,1.0,3.0,1.0,1.0]) |
|1.0  |Logistic regression models are neat|[logistic, regression, models, are, neat] |(20,[4,6,13,15,18],[1.0,1.0,1.0,1.0,1.0])|
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+

And the output for rescaledData:

+-----+----------------------------------------------------------------------------------------------------------------------+
|label|features                                                                                                              |
+-----+----------------------------------------------------------------------------------------------------------------------+
|0.0  |(20,[0,5,9,17],[0.6931471805599453,0.6931471805599453,0.28768207245178085,1.3862943611198906])                        |
|0.0  |(20,[2,7,9,13,15],[0.6931471805599453,0.6931471805599453,0.8630462173553426,0.28768207245178085,0.28768207245178085]) |
|1.0  |(20,[4,6,13,15,18],[0.6931471805599453,0.6931471805599453,0.28768207245178085,0.28768207245178085,0.6931471805599453])|
+-----+----------------------------------------------------------------------------------------------------------------------+

Why was 20 chosen as the number of features? And what do we mean by feature in this particular case?

What is the meaning of the first row value of column rawFeatures of featurizedData?

(20,[0,5,9,17],[1.0,1.0,1.0,2.0])

From plain eyesight, I can see that 20 has been appended from the number of features parameter, but I'm struggling to find the meaning of [0,5,9,17] and [1.0,1.0,1.0,2.0].

rescaledData seems to imply that some sort of scaling has been applied to the rawFeatures column on featurizedData, so I think I'm clear about that.

What conclusions and correlations can I establish between the vector pairs I'm receiving in the resulting dataframes from IDF and the words that were input?

I have tried to run a few example word sets from other websites with tutorials, but the resulting numbers they get in R or Python are totally different from what I'm seeing in Spark.

I ask this question since I am trying to understand this data science topic and overcome my ignorance.

Any help is appreciated!

Best Answer

For anyone interested in the explanation, I finally figured it out.

The code can be split into two general stages: hashing tf counts and idf calculation.

For hashing tf, the example sets 20 as the max length of the feature vector that will store term hashes using Spark's "hashing trick" (not liking the name :P), using MurmurHash3_x86_32 as the default string hash implementation. This will essentially compute a HashMap called termfrequencies, where each term will be hashed via murmur3 first, and then converted into a non-negative integer number by applying modulo of the murmur3 hash result by the length of the feature vector. As we start hashing and adding the terms, we have:

hi -> 0 -> 1.0
i -> 9 -> 1.0
heard -> 17 -> 1.0
about -> 17 -> 2.0
spark -> 5 -> 1.0

i -> 9 -> 1.0
wish -> 15 -> 1.0
java -> 7 -> 2.0
could -> 13 -> 1.0
use -> 9 -> 2.0
case -> 2 -> 1.0
classes -> 9 -> 3.0

logistic -> 4 -> 1.0
regression -> 15 -> 1.0
models -> 13 -> 1.0
are -> 18 -> 1.0
neat -> 6 -> 1.0

Notice that there is a hash collision between the terms heard and about in the first document, and between i, use and classes in the second document. This means that if we select a higher feature vector length, these collisions will be less likely to occur. This is evidenced by a similar answer to this same question on Stackoverflow (which leads me to believe I should have posted the question there instead).

Now, I'll try to figure out what the IDF scaling numbers mean in the context of each document, and within the entire corpus.

Update 2

There's a nice, clear explanation about the TF-IDF concept here. Essentially:

TF-IDF is the product of TF and IDF. A high TF-IDF is obtained when the The Term Frequency is high and the Document Frequency is low (IDF is high).

With this in mind, it makes sense to see that hash #17 in document 1 has a TD-IFD of 1.3862 since the input terms heard and about map to the same hashing bucket ,#17 (The Term Frequency is high) and do not appear in documents 2 and 3 (the Document Frequency is low).

The other interesting example is the terms that map to hash #9: i, use, and classes. In the first document, only the term i maps to hash #9, meaning that it has low local Term Frequency, and middle Document Frequency (appears two times in two documents), hence having a local TF-IDF score of 0.2876.

On the second document though, that same term along with terms use and classes maps to hash #9, having high Term Frequency and the same middle Document Frequency (appears two times in two documents), thus scoring locally as 0.8630.

This makes me believe that the initial feature vector length selection is really crucial to prevent hash collisions, and to produce TD-IDF values that apply to uniquely bucketed terms.

Specifically, I refer to the repercussions of tracing back a term to a particular hash value in the set of generated TF and TF-IDF value vector sets. The example above illustrates this since having hash collisions will result in different TF-IDF values assigned to a particular hashed term that collides with other terms.

Hope this helps someone else. Thanks!

Related Question