Solved – word2vec neural network – bias units

word2vec

I am trying to get my head around word2vec (paper) and the underlying Skip-gram model. I hope I got the basics and intuition, but I am still not sure whether bias units are used in the input and/or in the hidden layer.

The input is just a one-hot encoded vector and it is often said it just serves as a selector for the weights associated with the corresponding word (there is no activation function). I would say, there is no bias unit added to the input layer. Now as for the hidden layer, since the output neurons give the following:

softmax activation

where v' and v are "input and output representation of w"
I don't think there is a bias unit either.

In case I am right, why is there no need for bias units in this type of neural network? In case I am wrong, can anyone explain how do they fit into the description of the model?

Best Answer

It seems there are indeed no bias units at either layer. In his thesis on neural network based language models, Mikolov states that:

[...] biases are not used in the neural network, as no significant improvement of performance was observed - following the Occam's razor, the solution is as simple as it needs to be.

(Mikolov, T.: Statistical Language Models Based on Neural Networks, p. 29)

While this is a quote concerning recurrent neural networks specifically, I am going to assume the same is valid for the Skip-gram model.

Related Question