I am currently trying to understand what the example code provided as part of Spark's TF-IDF implementation is doing.
Given the example code block (taken from Spark's Github repository)
val sentenceData = sess.createDataFrame(Seq(
(0.0, "Hi I heard about Spark"),
(0.0, "I wish Java could use case classes"),
(1.0, "Logistic regression models are neat")
)).toDF("label", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
wordsData.show(false)
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
featurizedData.show(false)
val rescaledData = idfModel.transform(featurizedData)
rescaledData.select("label", "features").show(false)
And the output for wordsData
:
+-----+-----------------------------------+------------------------------------------+
|label|sentence |words |
+-----+-----------------------------------+------------------------------------------+
|0.0 |Hi I heard about Spark |[hi, i, heard, about, spark] |
|0.0 |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|
|1.0 |Logistic regression models are neat|[logistic, regression, models, are, neat] |
+-----+-----------------------------------+------------------------------------------+
And the output for featurizedData
:
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+
|label|sentence |words |rawFeatures |
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+
|0.0 |Hi I heard about Spark |[hi, i, heard, about, spark] |(20,[0,5,9,17],[1.0,1.0,1.0,2.0]) |
|0.0 |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(20,[2,7,9,13,15],[1.0,1.0,3.0,1.0,1.0]) |
|1.0 |Logistic regression models are neat|[logistic, regression, models, are, neat] |(20,[4,6,13,15,18],[1.0,1.0,1.0,1.0,1.0])|
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+
And the output for rescaledData
:
+-----+----------------------------------------------------------------------------------------------------------------------+
|label|features |
+-----+----------------------------------------------------------------------------------------------------------------------+
|0.0 |(20,[0,5,9,17],[0.6931471805599453,0.6931471805599453,0.28768207245178085,1.3862943611198906]) |
|0.0 |(20,[2,7,9,13,15],[0.6931471805599453,0.6931471805599453,0.8630462173553426,0.28768207245178085,0.28768207245178085]) |
|1.0 |(20,[4,6,13,15,18],[0.6931471805599453,0.6931471805599453,0.28768207245178085,0.28768207245178085,0.6931471805599453])|
+-----+----------------------------------------------------------------------------------------------------------------------+
Why was 20
chosen as the number of features? And what do we mean by feature in this particular case?
What is the meaning of the first row value of column rawFeatures
of featurizedData
?
(20,[0,5,9,17],[1.0,1.0,1.0,2.0])
From plain eyesight, I can see that 20 has been appended from the number of features parameter, but I'm struggling to find the meaning of [0,5,9,17]
and [1.0,1.0,1.0,2.0]
.
rescaledData
seems to imply that some sort of scaling has been applied to the rawFeatures
column on featurizedData
, so I think I'm clear about that.
What conclusions and correlations can I establish between the vector pairs I'm receiving in the resulting dataframes from IDF and the words that were input?
I have tried to run a few example word sets from other websites with tutorials, but the resulting numbers they get in R or Python are totally different from what I'm seeing in Spark.
I ask this question since I am trying to understand this data science topic and overcome my ignorance.
Any help is appreciated!
Best Answer
For anyone interested in the explanation, I finally figured it out.
The code can be split into two general stages: hashing tf counts and idf calculation.
For hashing tf, the example sets 20 as the max length of the feature vector that will store term hashes using Spark's "hashing trick" (not liking the name :P), using
MurmurHash3_x86_32
as the default string hash implementation. This will essentially compute aHashMap
calledtermfrequencies
, where each term will be hashed via murmur3 first, and then converted into a non-negative integer number by applying modulo of themurmur3
hash result by the length of the feature vector. As we start hashing and adding the terms, we have:Notice that there is a hash collision between the terms
heard
andabout
in the first document, and betweeni
,use
andclasses
in the second document. This means that if we select a higher feature vector length, these collisions will be less likely to occur. This is evidenced by a similar answer to this same question on Stackoverflow (which leads me to believe I should have posted the question there instead).Now, I'll try to figure out what the IDF scaling numbers mean in the context of each document, and within the entire corpus.
Update 2
There's a nice, clear explanation about the TF-IDF concept here. Essentially:
With this in mind, it makes sense to see that hash
#17
in document1
has a TD-IFD of1.3862
since the input termsheard
andabout
map to the same hashing bucket ,#17
(The Term Frequency is high) and do not appear in documents 2 and 3 (the Document Frequency is low).The other interesting example is the terms that map to hash
#9
:i
,use
, andclasses
. In the first document, only the termi
maps to hash#9
, meaning that it has low local Term Frequency, and middle Document Frequency (appears two times in two documents), hence having a local TF-IDF score of0.2876
.On the second document though, that same term along with terms
use
andclasses
maps to hash#9
, having high Term Frequency and the same middle Document Frequency (appears two times in two documents), thus scoring locally as0.8630
.This makes me believe that the initial feature vector length selection is really crucial to prevent hash collisions, and to produce TD-IDF values that apply to uniquely bucketed terms.
Specifically, I refer to the repercussions of tracing back a term to a particular hash value in the set of generated TF and TF-IDF value vector sets. The example above illustrates this since having hash collisions will result in different TF-IDF values assigned to a particular hashed term that collides with other terms.
Hope this helps someone else. Thanks!