ELMo Model – Difference Between Token Embedding and Character Embedding

natural languageneural networksrecurrent neural networkword embeddings

I am learning about a famous NLP model called ELMo. In the explanations, they talk about two types of embeddings. 1) character representations 2) token representations.

Why is there a need to consider two types of representations? What are the differences between them?
How do they affect the training?

Best Answer

Short version: the character representations are there so you can still embed tokens that were never seen during training.

Recall that embedding an (atomic) object is done by selecting the corresponding vector in a lookup table. ELMo has a token embedding for "cat", for "the", for "likewise", and so on. All of those exist in a table. But what about a rare word like "snuffleupagus"? It doesn't exist in the table, because the word wasn't seen during training.

The default NLP strategy for 50+ years for unseen tokens is to have a special "out-of-vocabulary" (OOV) representation. If we didn't find the word in our lookup table, we'll just use the vector for OOV.

The problem (that ELMo tries to solve with character embeddings) is that not all unseen words are created equal. Should "snuffleupagus", "dextromethorphan", and "arrogate" all be treated identically?

Words aren't atomic. They're made up of parts—ELMo uses characters as the parts. By instead creating a representation of the word based on its characters, the model can distinguish between these through words. It can also pick up on regular patterns. For instance, the -us ending typically signals a noun, and dextro- is typically associated with medicinal contexts.

Each character's embedding in ELMo is combined with a bi-LSTM, which gives a representation of the whole word token. There's thus information at two levels: the character level and the word level. Still, there are no fixed embeddings for words. FastText is a different model that uses a combination of word embeddings and subword embeddings; maybe that'd be a fun next thing to read about.

Best Answer

Related Solutions

Solved – Difference between non-contextual and contextual word embeddings

Natural Language Processing – What Exactly is Meant by Isotropic and Anisotropic Word Vectors?

Related Question