Natural Language – Difference Between Corpus and Vocabulary in NLP

natural language

In NLP, what is the difference between corpus and vocabulary?

I see these words often referred to and I feel like they are referring to the same thing. Is there a difference between them or are they the same thing?

Best Answer

They are not the same thing.

  • Corpus: Collection of texts used to train an NLP model.
  • Vocabulary: Collection of words used to train an NLP model.

It might be easier to explain by example: BERT is an advanced NLP model trained on the entire content of Wikipedia (originally the English language Wikipedia). The corpus is the collection of Wikipedia articles it was trained on. The vocabulary is the vocabulary of the English language.

Related Question