Solved – Advantage of character based language models over word based

language-modelsmachine learningnatural languageneural networksrecurrent neural network

Is there an intuition why character based models language bases models are preferred over word based. For example Karpathy builds his language model by predicting the next character in Karpathy Blog.

The aspect I am struggling with is that not each combination of characters is a word, so intuitively I would try to predict the next word (or word embedding and calculate squared error). I think this is also used in the sentence embedding proposed by Kiros Skip Thought.

So my question is what are advantages and disadvantages for character based language models in comparison to word based models.

Best Answer

The main advantage of working with character-level generative models is that the discrete space you're working with is much smaller -- there are about 97 English-language characters in common usage if we include all punctuation marks. By contrast, a vocabulary is many thousands of words. This implies that just storing the word embeddings will require a lot of memory, and including word embeddings in a model adds many, many parameters to the model so the computational cost is much higher on this account than the character-level model.

Mispelled words or other words not appearing in the vocabulary are usually treated as a special "unknown" token. This suffices for some practical contexts, but it also means that the model is not terribly flexible when it comes to generate new text.

On the other hand, character-level models can spontaneously generate unusual words with some (small) probability. As you point out, this can also mean that nonsense words or obvious typos can creep into the generated text, but in my experience, a well-trained generative model usually does pretty well at learning how to spell.

One of the key papers in character-level RNNs is "Generating Sequences With Recurrent Neural Networks" by Alex Graves.