Solved – How the embedding layer is trained in Keras Embedding layer

deep learningkerasword embeddings

How is the embedding layer trained in Keras Embedding layer?
(say using tensorflow backend, meaning is it similar to word2vec, glove or fasttext)

Assume we do not use a pretrained embedding.

Best Answer

Embedding layers in Keras are trained just like any other layer in your network architecture: they are tuned to minimize the loss function by using the selected optimization method. The major difference with other layers, is that their output is not a mathematical function of the input. Instead the input to the layer is used to index a table with the embedding vectors [1]. However, the underlying automatic differentiation engine has no problem to optimize these vectors to minimize the loss function...

So, you cannot say that the Embedding layer in Keras is doing the same as word2vec [2]. Remember that word2vec refers to a very specific network setup which tries to learn an embedding which captures the semantics of words. With Keras's embedding layer, you are just trying to minimize the loss function, so if for instance you are working with a sentiment classification problem, the learned embedding will probably not capture complete word semantics but just their emotional polarity...

For example, the following image taken from [3] shows the embedding of three sentences with a Keras Embedding layer trained from scratch as part of a supervised network designed to detect clickbait headlines (left) and pre-trained word2vec embeddings (right). As you can see, word2vec embeddings reflect the semantic similarity between phrases b) and c). Conversely, the embeddings generated by Keras's Embedding layer might be useful for classification, but do not capture the semantical similarity of b) and c).

enter image description here

This explains why when you have a limited amount of training samples, it might be a good idea to initialize your Embedding layer with word2vec weights, so at least your model recognizes that "Alps" and "Himalaya" are similar things, even if they don't both occur in sentences of your training dataset.

[1] How does Keras 'Embedding' layer work?

[2] https://www.tensorflow.org/tutorials/word2vec

[3] https://link.springer.com/article/10.1007/s10489-017-1109-7

NOTE: Actually, the image shows the activations of the layer after the Embedding layer, but for the purpose of this example it does not matter... See more details in [3]