Solved – difference between keras embedding layer and word2vec

kerasnatural languageneural networksword embeddingsword2vec

In other words, is there a paper that describes the method of keras embedding layer? Is there a comparison between these methods (and other methods like Glove etc.)?

Best Answer

Embeddings (in general, not only in Keras) are methods for learning vector representations of categorical data. They are most commonly used for working with textual data. Word2vec and GloVe are two popular frameworks for learning word embeddings. What embeddings do, is they simply learn to map the one-hot encoded categorical variables to vectors of floating point numbers of smaller dimensionality then the input vectors. For example, one-hot vector representing a word from vocabulary of size 50 000 is mapped to real-valued vector of size 100. Then, the embeddings vector is used for whatever you want to use it as features.

one-hot vector $\to$ real-valued vector $\to$ (additional layers of the network)

The difference is how Word2vec is trained, as compared to the "usual" learned embeddings layers. Word2vec is trained to predict if word belongs to the context, given other words, e.g. to tell if "milk" is a likely word given the "The cat was drinking..." sentence begging. By doing so, we expect Word2vec to learn something about the language, as in the quote "You shall know a word by the company it keeps" by John Rupert Firth. Using the above example, Word2vec learns that "cat" is something that is likely to appear together with "milk", but also with "house", or "pet", so it is somehow similar to "dog". As a consequence, embeddings created by Word2vec, or similar models, learn to represent words with similar meanings using similar vectors.

On another hand, with embeddings learned as a layer of a neural network, the network may be trained to predict whatever you want. For example, you can train your network to predict sentiment of a text. In such case, the embeddings would learn features that are relevant for this particular problem. As a side effect, they can learn also some general things about the language, but the network is not optimized for such task. Using the "cat" example, embeddings trained for sentiment analysis may learn that "cat" and "dog" are similar, because people often say nice things about their pets.

In practical terms, you can use the pretrained Word2vec embeddings as features of any neural network (or other algorithm). They can give you advantage if your data is small, since the pretrained embeddings were trained on large volumes of text. On another hand, there are examples showing that learning the embeddings from your data, optimized for a particular problem, may be more efficient (Qi et al, 2018).

Qi, Y., Sachan, D. S., Felix, M., Padmanabhan, S. J., & Neubig, G. (2018). When and Why are Pre-trained Word Embeddings Useful for Neural Machine Translation? arXiv preprint arXiv:1804.06323.

Related Question