Solved – Keras neural network input shapes

deep learningword embeddings

In this example from the Keras repository, this convolutional neural network is trained:

>>> model.summary()
------------------------------------------
Initial input shape: (None, 5000)  -------
------------------------------------------
Layer (name)                  Output Shape                  Param #             
------------------------------------------
Embedding (embedding)         (None, 100, 100)              500000              
Dropout (dropout)             (None, 100, 100)              0                   
Convolution1D (convolution1d) (None, 98, 250)               75250               
MaxPooling1D (maxpooling1d)   (None, 49, 250)               0                   
Flatten (flatten)             (None, 12250)                 0                   
Dense (dense)                 (None, 250)                   3062750             
Dropout (dropout)             (None, 250)                   0                   
Activation (activation)       (None, 250)                   0                   
Dense (dense)                 (None, 1)                     251                 
Activation (activation)       (None, 1)                     0                   
------------------------------------------
Total params: 3638251
------------------------------------------

The imdb dataset contains reviews, each composed of different words. So given a 5000-word vocabulary, according to the Initial input shape the first embedding layer should get a vector of 5000 counters for each review. Each counter i saying how many times the i-th word appears, I suppose.
However when I dump any set's input shape, I get:

>>> X_train.shape
(20000, 100)

Where 20000 is the number of the reviews (let's ignore it here), and 100 is the number of words. From there the embedding layer maps it to

(None, 100, 100)

as seen in the column Output shape. The first 100 is the number of words in the review, and the second 100 is a mapping of each word to a 100-dimension space.

I cannot understand the discrepancy between Initial input shape and X.shape where X is a training or test set. What is happening here?

Best Answer

From what I think, your X datasets are 20K sequences of 100 "words" each, where each word is a string. Embedding layer is 1st converting it to one-hot i.e. Initial input shape and then to the embedding dimension.

Relation to Word2Vec

==========================================

Word2Vec in a simple picture:

word2vec pic

More in-depth explanation:

I believe it's related to the recent Word2Vec innovation in natural language processing. Roughly, Word2Vec means our vocabulary is discrete and we will learn an map which will embed each word into a continuous vector space. Using this vector space representation will allow us to have a continuous, distributed representation of our vocabulary words. If for example our dataset consists of n-grams, we may now use our continuous word features to create a distributed representation of our n-grams. In the process of training a language model we will learn this word embedding map. The hope is that by using a continuous representation, our embedding will map similar words to similar regions. For example in the landmark paper Distributed Representations of Words and Phrases and their Compositionality, observe in Tables 6 and 7 that certain phrases have very good nearest neighbour phrases from a semantic point of view. Transforming into this continuous space allows us to use continuous metric notions of similarity to evaluate the semantic quality of our embedding.

Explanation using Lasagne code

Let's break down the Lasagne code snippet:

x = T.imatrix()

x is a matrix of integers. Okay, no problem. Each word in the vocabulary can be represented an integer, or a 1-hot sparse encoding. So if x is 2x2, we have two datapoints, each being a 2-gram.

l_in = InputLayer((3, ))

The input layer. The 3 represents the size of our vocabulary. So we have words $w_0, w_1, w_2$ for example.

W = np.arange(3*5).reshape((3, 5)).astype('float32')

This is our word embedding matrix. It is a 3 row by 5 column matrix with entries 0 to 14.

Up until now we have the following interpretation. Our vocabulary has 3 words and we will embed our words into a 5 dimensional vector space. For example, we may represent one word $w_0 = (1,0,0)$, and another word $w_1 = (0, 1, 0)$ and the other word $w_2 = (0, 0, 1)$, e.g. as hot sparse encodings. We can view the $W$ matrix as embedding these words via matrix multiplication. Therefore the first word $w_0 \rightarrow w_0W = [0, 1, 2, 3, 4].$ Simmilarly $w_1 \rightarrow w_1W = [5, 6, 7, 8, 9]$.

It should be noted, due to the one-hot sparse encoding we are using, you also see this referred to as table lookups.

l1 = EmbeddingLayer(l_in, input_size=3, output_size=5, W=W)

The embedding layer

 output = get_output(l1, x)

Symbolic Theano expression for the embedding.

f = theano.function([x], output)

Theano function which computes the embedding.

x_test = np.array([[0, 2], [1, 2]]).astype('int32')

It's worth pausing here to discuss what exactly x_test means. First notice that all of x_test entries are in {0, 1, 2}, i.e. range(3). x_test has 2 datapoints. The first datapoint [0, 2] represents the 2-gram $(w_0, w_2)$ and the second datapoint represents the 2-gram $(w_1, w_2)$.

We wish to embed our 2-grams using our word embedding layer now. Before we do that, let's make sure we're clear about what should be returned by our embedding function f. The 2 gram $(w_0, w_2)$ is equivalent to a [[1, 0, 0], [0, 0, 1]] matrix. Applying our embedding matrix W to this sparse matrix should yield: [[0, 1, 2, 3, 4], [10, 11, 12, 13, 14]]. Note in order to have the matrix multiplication work out, we have to apply the word embedding matrix $W$ via right multiplication to the sparse matrix representation of our 2-gram.

f(x_test)

returns:

          array([[[  0.,   1.,   2.,   3.,   4.],
                  [ 10.,  11.,  12.,  13.,  14.]],
                 [[  5.,   6.,   7.,   8.,   9.],
                  [ 10.,  11.,  12.,  13.,  14.]]], dtype=float32)

To convince you that the 3 does indeed represent the vocabulary size, try inputting a matrix x_test = [[5, 0], [1, 2]]. You will see that it raises a matrix mis-match error.

Solved – Understanding Word2Vec

what are my actual word vectors in the end?

The actual word vectors are the hidden representations $h$ Basically, multiplying a one hot vector with $\mathbf{W_{V\times N}}$ will give you a $1$$\times$$N$ vector which represents the word vector for the one hot you entered.

Here we multiply the one hot $1$$\times$$5$ for say 'chicken' with synapse 1 $\mathbf{W_{V\times N}}$ to get the vector representation : $1$$\times$$3$

Basically, $\mathbf{W_{V\times N}}$ captures the hidden representations in the form of a look up table. To get the look up value, multiply $\mathbf{W_{V\times N}}$ with the one hot of that word.

That would mean each input matrix $\mathbf{W_{V\times N}}$ would be $(1 \times 10^{11}) \times 300$ in size!?

Yes, that is correct. Keep in mind 2 things:

It is Google. They have a lot of computational resources.
A lot of optimisations were used to speed up training. You can go through the original code which is publicly available.

Shouldn't it be possible to use lower dimensional vectors?

I assume you mean use a vector like [ 1.2 4.5 4.3] to represent say 'chicken'. Feed that into the network and train on it. Seems like a good idea. I cannot justify the reasoning well enough, but I would like to point out the following:

One Hots allow us to activate only one input neuron at once. So the representation of the word falls down to specific weights just for that word. Here, the one hot for 'juice' is activating just 4 synaptic links per synapse.
The loss function used is probably Cross Entropy Loss which usually employs one hot representations. This loss function heavily penalises incorrect classifications which is aided by one hot representations. In fact, most classification tasks employ one hots with Cross Entropy Loss.

I know this isn't a satisfactory reasoning.

I hope this clears some things up.

Here are some resources :

The famous article by Chris McCormick
Interactive w2v model : wevi
Understand w2v by understanding it in tensorflow (my article (shameless advertisement,but it covers what I want to say))

Best Answer

Related Solutions

Solved – an embedding layer in a neural network

Relation to Word2Vec

Word2Vec in a simple picture:

More in-depth explanation:

Explanation using Lasagne code

Solved – Understanding Word2Vec

Related Question