Solved – How does Keras ‘Embedding’ layer work

kerastext miningword embeddings

Need to understand the working of 'Embedding' layer in Keras library. I execute the following code in Python

import numpy as np
from keras.models import Sequential
from keras.layers import Embedding

model = Sequential()
model.add(Embedding(5, 2, input_length=5))

input_array = np.random.randint(5, size=(1, 5))

model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)

which gives the following output

input_array = [[4 1 3 3 3]]
output_array = 
[[[ 0.03126476  0.00527241]
  [-0.02369716 -0.02856163]
  [ 0.0055749   0.01492429]
  [ 0.0055749   0.01492429]
  [ 0.0055749   0.01492429]]]

I understand that each value in the input_array is mapped to 2 element vector in the output_array, so a 1 X 4 vector gives 1 X 4 X 2 vectors. But how are the mapped values computed?

Best Answer

In fact, the output vectors are not computed from the input using any mathematical operation. Instead, each input integer is used as the index to access a table that contains all possible vectors. That is the reason why you need to specify the size of the vocabulary as the first argument (so the table can be initialized).

The most common application of this layer is for text processing. Let's see a simple example. Our training set consists only of two phrases:

Hope to see you soon

Nice to see you again

So we can encode these phrases by assigning each word a unique integer number (by order of appearance in our training dataset for example). Then our phrases could be rewritten as:

[0, 1, 2, 3, 4]

[5, 1, 2, 3, 6]

Now imagine we want to train a network whose first layer is an embedding layer. In this case, we should initialize it as follows:

Embedding(7, 2, input_length=5)

The first argument (7) is the number of distinct words in the training set. The second argument (2) indicates the size of the embedding vectors. The input_length argument, of course, determines the size of each input sequence.

Once the network has been trained, we can get the weights of the embedding layer, which in this case will be of size (7, 2) and can be thought as the table used to map integers to embedding vectors:

+------------+------------+
|   index    |  Embedding |
+------------+------------+
|     0      | [1.2, 3.1] |
|     1      | [0.1, 4.2] |
|     2      | [1.0, 3.1] |
|     3      | [0.3, 2.1] |
|     4      | [2.2, 1.4] |
|     5      | [0.7, 1.7] |
|     6      | [4.1, 2.0] |
+------------+------------+

So according to these embeddings, our second training phrase will be represented as:

[[0.7, 1.7], [0.1, 4.2], [1.0, 3.1], [0.3, 2.1], [4.1, 2.0]]

It might seem counterintuitive at first, but the underlying automatic differentiation engines (e.g., Tensorflow or Theano) manage to optimize these vectors associated with each input integer just like any other parameter of your model.

For an intuition of how this table lookup is implemented as a mathematical operation which can be handled by the automatic differentiation engines, consider the embeddings table from the example as a (7, 2) matrix. Then, for a given word, you create a one-hot vector based on its index and multiply it by the embeddings matrix, effectively replicating a lookup. For instance, for the word "soon" the index is 4, and the one-hot vector is [0, 0, 0, 0, 1, 0, 0]. If you multiply this (1, 7) matrix by the (7, 2) embeddings matrix you get the desired two-dimensional embedding, which in this case is [2.2, 1.4].

It is also interesting to use the embeddings learned by other methods/people in different domains (see https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) as done in [1].

[1] López-Sánchez, D., Herrero, J. R., Arrieta, A. G., & Corchado, J. M. Hybridizing metric learning and case-based reasoning for adaptable clickbait detection. Applied Intelligence, 1-16.