Solved – Is the Keras Embedding layer dependent on the target label

embeddingskerasneural networksword embeddings

I learned how to 'use' the Keras Embedding layer, but I am not able to find any more specific information about the actual behavior and training process of this layer. For now, I understand that the Keras Embedding layer maps distinct categorical features to n-dimensional vectors, which allows us to find, for example, how similar two features are.

What I do not understand is how these vectors in the embedding layer are trained. Here is an explanation where there is information that these vectors are not computed with any operation, but working only as a lookup table, but I always thought that they are somehow "trained" to find similarities between distinct features.

If they are trained, are they trained from target labels, or from the order in which they appear (similar to GloVe, word2vec, etc.) or from both?

I have the following example of two pairs of rows in a dataset. y is the model target label and X are the features encoded to integers to be used in the embedding layer:

#pair 1    
dataset_y_row1 = [1]
dataset_y_row2 = [0]
dataset_X_row1 = [3,5,8,45,2]
dataset_X_row2 = [3,5,8,45,2]

#pair 2
dataset_y_row3 = [1]
dataset_y_row4 = [1]
dataset_X_row3 = [3,5,8,45,2]
dataset_X_row4 = [3,5,45,8,2]

My questions are the following:

  1. Will the embedding layer see any difference between rows 1 and 2 (i.e. is
    it 'target-label-sensitive')?
  2. Will the embedding layer see any difference between rows 3 and 4 (i.e. is it sensitive to order of features like word2vec, GloVe, etc.)?

Best Answer

Embeddings layer for vocabulary of size $m$, that encodes each word into embeddings vector of size $k$ is a shorthand for having the words one-hot encoded using into $m$ features and then putting dense layer with $k$ units over it. Word2vec and GloVe are specialized algorithms for learning the embeddings, but the end product is a matrix of weights that is multiplied by the one-hot encoded words.

If you are interested in detailed, yet accessible introductory source on word embeddingss, check the series of blog post by Sebastian Ruder .

To answer your question, one would need to consider what is your network architecture and the data. Algorithms like word2vec and GloVe are trained on language data, to predict things like next word in a sequence. On another hand, if you use the embeddingss layer that is trained from the scratch and used as a part of larger network, that has some utilitarian purpose (e.g. spam detection, sentiment classification), then the layers work as any other dense layers, so they serve purpose of automatic feature engineering. In the latter case, you would expect to see more specialised embeddingss, that would learn features related to the objective of your network.