Solved – How does a bag-of-words model treat words that were never seen before (not in the training data)

bag of wordsclassificationmachine learningnatural languagetext mining

What happens when a text classifier using a bag-of-words model (let's say we're using logistic regression) encounters a word that the model has not seen before- aka, words that were not in the training data? How would it handle or treat these extra features?

The reason this confuses me is that usually samples that we are trying to predict or test have to have the same number of features as training samples, but it seems like in most text classification implementations, we can predict samples that have higher dimensions than the training samples.

Best Answer

Bag of words is a representation of the text. The classifier is built upon it is up to the classifier how to use the word. For example, it is likely that a decision tree will use few words and will be indifferent to all words not appearing in the train set.

Note that since words distribution contains many rare words the situation you described is very common. While one could extend the standard bag of word representation to contain "other" for new words, it would probably won't be beneficial since it is likely to be true in most cases. Add the that the lose of exact new word meaning, and the technicality of train test definition (so the classifier will learn to use new words), it seems not to worth the effort.

In case that you have a way to aggregate words (e.g., stemming, lemmatisation) than you will be able to benefit by seeing new words (that are similar to words already seen).

Related Question