Solved – Learning image embeddings using VGG and Word2Vec

classificationembeddingsmachine learningword embeddingsword2vec

Background: In word2vec we pass in a one-hot encoding of our target word into a simple neural network which is trained to predict context words from a window around our target. We eventually take the weights from our hidden layer to use as word embeddings – vector representations of our words.

When we put in a one-hot encoding into our hidden layer we end up essentially doing a look-up of the corresponding row in the weight matrix, so when training as finished we can say that this row represents the embedding for a particular word.

My question is: I am passing images through a pre-trained VGG network, and then use the final intermediary layer as my encodings for my images. I am then using these encodings as a replacement for my one-hot vectors and feeding them into a skip-gram architecture like W2V to learn an image embedding.

In this case the dot product of my input feature vector and the hidden layer weights are no longer a look-up affecting a single row – s how can I connect images with their embeddings in the this setup?

Edit: Some added context – I am trying to get style embeddings in a style vector space using this paper https://arxiv.org/pdf/1708.04014.pdf. The paper describes putting the images through a pre-trained VGG and then using these feature vectors to train the embedding layer. Normally we would then go and take the embedding layer weights, but in this case there isn't a single row that connects with our image because it isn't one-hot. Do we take the embedding layer activations instead of the actual weights?

Best Answer

I think one-hot encoding is just special representation of word. It is unnecessary. If you get a image word representation $v$ from last fc layer of VGG network, and $W$ is hidden layer of embedding, the dot product $v\cdot W$ will give you the embedding code.

Related Question