TF-IDF – Calculating TF-IDF on a Test Set After Training with Scikit-Learn

machine learningpythonscikit learntf-idf

I have a need to engineer features from TF-IDF values for a downstream classification task. I (think) I have a reasonable grasp of TF-IDF as described in Sci-kit Learn documentation, but am unable to work out how the respective values are calculated when working from distinct train/test sets. Most of the online tutorials calculate TF-IDF values on the whole dataset, which by my eye results in data leakage as the IDF values will surely be influenced by the increased data-points when using the whole dataset relative to the training set.

Following the above documentation, I have written a working script by applying the fit_transform function on the training set, and then applying the transform function on the test set. This gives me two transformed dataframes (one train, one test) – but I cannot work out how the values in the test set are calculated.

My assumption for generating test set values is as follows, for each term t in the training set:

  • TF: the number of documents in the test set with term t divided by the total documents in the test set.
  • IDF: the number of documents in the training set divided by the number of documents containing term t in the training set

Basically, I am unsure about what the transform function is doing, and from what sample it is accessing at a given time. Here, the documentation states:

Transform documents to document-term matrix.
Uses the vocabulary and document frequencies (df) learned by fit (or fit_transform).

Which leads me to think my current interpretation may not be far from correct, but I am having a hard time confirming.

Note: I am posting here rather than stack overflow as I am fairly sure the script I have written works, I am just trying to understand how the values are calculated.

Best Answer

Term Frequency is not based on a corpus (except in setting the vocabulary, which is based on the training set): it is just the count of terms within a single document. But you are correct about the Inverse Document Frequency part: sklearn uses the statistics from the training set when transforming new data.

You can test that pretty quickly by transforming a test dataset's first row separately from the entire test dataset: you ought to get the same transformed value for the first row, independent of what other rows are in the test set. You may also like to read through the toy example in the User Guide, or have a look at the source code for TfidfVectorizer.transform (note that super here is a CountVectorizer and self._tfidf is a TfidfTransformer) and TfidfTransformer.transform.