Solved – How to make a seq2seq model work with infinite vocabulary

deep learningnatural languageneural networksrecurrent neural networkseq2seq

I have trained a translation seq2seq model. In my model, I have kept vocabulary size to 100,000. This constraint limits my model from generating any words which are not in this 100,000.

So how does Google Translate or Bing Translate works for any word in their input?

Basically, my question is how to make my model work with infinite vocabulary.

Best Answer

Basically, my question is how to make my model work with infinite vocabulary.

It would be unwise (how are you going to do optimization with such data?).

But you don't have to. Basically you're asking how to deal with unknown words.

One answer for that is to just use some other representation for words - instead as representing them as one-hot vectors from some vocabulary, you can use subword features (like characters or character n-grams) - you can find papers using this terminology, they're also called character-level features.

For intuition you could look into lingustic knowledge - most words aren't actually completely unrelated to other words, but they're formed from more basic parts, or morphemes.