Solved – Understanding Word2Vec

bag of wordsword embeddingsword2vec

I am trying to understand the word2vec algorithm (Mikolov et. al) but there are a few thing which I do not understand.

I get that the activation from the input layer ot the hidden layer is linear and that $\mathbf{h}$ is just the average of the linear combination of all $\mathbf{x}_{ik}$ vectors. Further I understand that the final output of each node $y_i$ of the output vector $\mathbf{y}$ is just the $\text{softmax}$ activation function from the hidden to the output layer.

enter image description here

What I do not understand is what are my actual word vectors in the end? I currently see only one possibility and that is to take either the columns of $\mathbf{W_{V\times N}}$ or the rows of $\mathbf{W'_{N\times V}}$ since I assume these weights are now the "information carrying" entity of this algorithm.

Am I correct with this assumption?

Independent from this I would like to understand how Google managed to train their Google News word vector models.

From the website:

We are publishing pre-trained vectors trained on part of Google News dataset (about 100 billion words).

Which is simply insane if this has been done with one-hot encoded vectors for training. That would mean each input matrix $\mathbf{W_{V\times N}}$ would be $(1 \times 10^{11}) \times 300$ in size!?

This leads me to my final question: Shouldn't it be possible to use lower dimensional vectors? Somewhere in the back of my head I have the idea, that you can simply randomly initialize word vectors for all given words in a vocabulary and then apply word2vec on it. This however would also mean that not only the weights have to get updated, but also the word vectors of each input words too. Is something like this actually done or am I completely mistaken here?

Best Answer

what are my actual word vectors in the end?

The actual word vectors are the hidden representations $h$ Basically, multiplying a one hot vector with $\mathbf{W_{V\times N}}$ will give you a $1$$\times$$N$ vector which represents the word vector for the one hot you entered.

Here we multiply the one hot $1$$\times$$5$ for say 'chicken' with synapse 1 $\mathbf{W_{V\times N}}$ to get the vector representation : $1$$\times$$3$

Basically, $\mathbf{W_{V\times N}}$ captures the hidden representations in the form of a look up table. To get the look up value, multiply $\mathbf{W_{V\times N}}$ with the one hot of that word.

enter image description here

That would mean each input matrix $\mathbf{W_{V\times N}}$ would be $(1 \times 10^{11}) \times 300$ in size!?

Yes, that is correct. Keep in mind 2 things:

  1. It is Google. They have a lot of computational resources.

  2. A lot of optimisations were used to speed up training. You can go through the original code which is publicly available.

Shouldn't it be possible to use lower dimensional vectors?

I assume you mean use a vector like [ 1.2 4.5 4.3] to represent say 'chicken'. Feed that into the network and train on it. Seems like a good idea. I cannot justify the reasoning well enough, but I would like to point out the following:

  1. One Hots allow us to activate only one input neuron at once. So the representation of the word falls down to specific weights just for that word. enter image description here Here, the one hot for 'juice' is activating just 4 synaptic links per synapse.

  2. The loss function used is probably Cross Entropy Loss which usually employs one hot representations. This loss function heavily penalises incorrect classifications which is aided by one hot representations. In fact, most classification tasks employ one hots with Cross Entropy Loss.

I know this isn't a satisfactory reasoning.

I hope this clears some things up.

Here are some resources :

  1. The famous article by Chris McCormick
  2. Interactive w2v model : wevi
  3. Understand w2v by understanding it in tensorflow (my article (shameless advertisement,but it covers what I want to say))