Solved – Understanding and interpreting the output of Spark’s TF-IDF implementation

machine learningspark-mllibtext miningtf-idf

I am currently trying to understand what the example code provided as part of Spark's TF-IDF implementation is doing.

Given the example code block (taken from Spark's Github repository)

val sentenceData = sess.createDataFrame(Seq(
  (0.0, "Hi I heard about Spark"),
  (0.0, "I wish Java could use case classes"),
  (1.0, "Logistic regression models are neat")
)).toDF("label", "sentence")

val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
wordsData.show(false)

val hashingTF = new HashingTF()
  .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)

val featurizedData = hashingTF.transform(wordsData)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
featurizedData.show(false)

val rescaledData = idfModel.transform(featurizedData)
rescaledData.select("label", "features").show(false)

And the output for wordsData:

+-----+-----------------------------------+------------------------------------------+
|label|sentence                           |words                                     |
+-----+-----------------------------------+------------------------------------------+
|0.0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |
|0.0  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|
|1.0  |Logistic regression models are neat|[logistic, regression, models, are, neat] |
+-----+-----------------------------------+------------------------------------------+

And the output for featurizedData:

+-----+-----------------------------------+------------------------------------------+-----------------------------------------+
|label|sentence                           |words                                     |rawFeatures                              |
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+
|0.0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |(20,[0,5,9,17],[1.0,1.0,1.0,2.0])        |
|0.0  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(20,[2,7,9,13,15],[1.0,1.0,3.0,1.0,1.0]) |
|1.0  |Logistic regression models are neat|[logistic, regression, models, are, neat] |(20,[4,6,13,15,18],[1.0,1.0,1.0,1.0,1.0])|
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+

And the output for rescaledData:

+-----+----------------------------------------------------------------------------------------------------------------------+
|label|features                                                                                                              |
+-----+----------------------------------------------------------------------------------------------------------------------+
|0.0  |(20,[0,5,9,17],[0.6931471805599453,0.6931471805599453,0.28768207245178085,1.3862943611198906])                        |
|0.0  |(20,[2,7,9,13,15],[0.6931471805599453,0.6931471805599453,0.8630462173553426,0.28768207245178085,0.28768207245178085]) |
|1.0  |(20,[4,6,13,15,18],[0.6931471805599453,0.6931471805599453,0.28768207245178085,0.28768207245178085,0.6931471805599453])|
+-----+----------------------------------------------------------------------------------------------------------------------+

Why was `20` chosen as the number of features? And what do we mean by feature in this particular case?

What is the meaning of the first row value of column `rawFeatures` of `featurizedData`?

(20,[0,5,9,17],[1.0,1.0,1.0,2.0])

From plain eyesight, I can see that 20 has been appended from the number of features parameter, but I'm struggling to find the meaning of [0,5,9,17] and [1.0,1.0,1.0,2.0].

rescaledData seems to imply that some sort of scaling has been applied to the rawFeatures column on featurizedData, so I think I'm clear about that.

What conclusions and correlations can I establish between the vector pairs I'm receiving in the resulting dataframes from IDF and the words that were input?

I have tried to run a few example word sets from other websites with tutorials, but the resulting numbers they get in R or Python are totally different from what I'm seeing in Spark.

I ask this question since I am trying to understand this data science topic and overcome my ignorance.

Any help is appreciated!

Best Answer

For anyone interested in the explanation, I finally figured it out.

The code can be split into two general stages: hashing tf counts and idf calculation.

For hashing tf, the example sets 20 as the max length of the feature vector that will store term hashes using Spark's "hashing trick" (not liking the name :P), using MurmurHash3_x86_32 as the default string hash implementation. This will essentially compute a HashMap called termfrequencies, where each term will be hashed via murmur3 first, and then converted into a non-negative integer number by applying modulo of the murmur3 hash result by the length of the feature vector. As we start hashing and adding the terms, we have:

hi -> 0 -> 1.0
i -> 9 -> 1.0
heard -> 17 -> 1.0
about -> 17 -> 2.0
spark -> 5 -> 1.0

i -> 9 -> 1.0
wish -> 15 -> 1.0
java -> 7 -> 2.0
could -> 13 -> 1.0
use -> 9 -> 2.0
case -> 2 -> 1.0
classes -> 9 -> 3.0

logistic -> 4 -> 1.0
regression -> 15 -> 1.0
models -> 13 -> 1.0
are -> 18 -> 1.0
neat -> 6 -> 1.0

Notice that there is a hash collision between the terms heard and about in the first document, and between i, use and classes in the second document. This means that if we select a higher feature vector length, these collisions will be less likely to occur. This is evidenced by a similar answer to this same question on Stackoverflow (which leads me to believe I should have posted the question there instead).

Now, I'll try to figure out what the IDF scaling numbers mean in the context of each document, and within the entire corpus.

Update 2

There's a nice, clear explanation about the TF-IDF concept here. Essentially:

TF-IDF is the product of TF and IDF. A high TF-IDF is obtained when the The Term Frequency is high and the Document Frequency is low (IDF is high).

With this in mind, it makes sense to see that hash #17 in document 1 has a TD-IFD of 1.3862 since the input terms heard and about map to the same hashing bucket ,#17 (The Term Frequency is high) and do not appear in documents 2 and 3 (the Document Frequency is low).

The other interesting example is the terms that map to hash #9: i, use, and classes. In the first document, only the term i maps to hash #9, meaning that it has low local Term Frequency, and middle Document Frequency (appears two times in two documents), hence having a local TF-IDF score of 0.2876.

On the second document though, that same term along with terms use and classes maps to hash #9, having high Term Frequency and the same middle Document Frequency (appears two times in two documents), thus scoring locally as 0.8630.

This makes me believe that the initial feature vector length selection is really crucial to prevent hash collisions, and to produce TD-IDF values that apply to uniquely bucketed terms.

Specifically, I refer to the repercussions of tracing back a term to a particular hash value in the set of generated TF and TF-IDF value vector sets. The example above illustrates this since having hash collisions will result in different TF-IDF values assigned to a particular hashed term that collides with other terms.

Hope this helps someone else. Thanks!

Related Solutions

Solved – Feature selection methods for document classtification

Introduction to Information Retrieval book contains some relevant material.

If python is your cup of tea (and if you have a moderate amount of data) then this deck might be helpful. Basically, one can train nltk's naive bayes classifier that, among other things, allows choosing top N most informative features (so one could then restrict the feature set to, say, top 1000 or top 10000 features - ideally this threshold should be tuned on a holdout sample or using cross validation):

>>> help(nltk.classify.NaiveBayesClassifier.most_informative_features) Help on method most_informative_features in module nltk.classify.naivebayes:

most_informative_features(self, n=100) unbound nltk.classify.naivebayes.NaiveBayesClassifier method
    Return a list of the 'most informative' features used by this
    classifier.  For the purpose of this function, the
    informativeness of a feature C{(fname,fval)} is equal to the
    highest value of P(fname=fval|label), for any label, divided by
    the lowest value of P(fname=fval|label), for any label::

      max[ P(fname=fval|label1) / P(fname=fval|label2) ]

In addition to unigram/bag-of-words based features, one could try adding significant bigrams to the feature list (the deck has some examples). nltk provides multiple ways to calculate significance for collocations (including chi-squared)

Another popular approach is to apply tf-idf to all features first (without any feature selection), and use the regularization (L1 and/or L2) to deal with irrelevant features (the SVM example from the deck corresponds to L2 regularization). The drawback is that the regularization coefficient has to be tuned on a holdout data set or using cross validation.

Solved – Finding multiple topics in short texts

The approach of training the classifier on single-class "sentences" (actually, not necessarily sentences, but phrases) is promising, but where it breaks down is you are training the classifier on a different type of data than what you're using it on.

You need a pre-processing step that breaks the sentences down into phrases, and use that same phrase-level data for both training and actual classification. Then, in your reporting step, you could aggregate all of the positive phrase-level results to the level of the input sentence (each sentence is a list of phrases with some positive identification of classes, and so you would simply combine all of the phrase-level positive results).

This approach isn't perfect either - it will often fail properly account for contextual information like negation and pronouns. But, it's the next logical step if you're building it from the ground up.

Why was 20 chosen as the number of features? And what do we mean by feature in this particular case?

What is the meaning of the first row value of column rawFeatures of featurizedData?