Solved – Learning image embeddings using VGG and Word2Vec

Background: In word2vec we pass in a one-hot encoding of our target word into a simple neural network which is trained to predict context words from a window around our target. We eventually take the weights from our hidden layer to use as word embeddings – vector representations of our words.

When we put in a one-hot encoding into our hidden layer we end up essentially doing a look-up of the corresponding row in the weight matrix, so when training as finished we can say that this row represents the embedding for a particular word.

My question is: I am passing images through a pre-trained VGG network, and then use the final intermediary layer as my encodings for my images. I am then using these encodings as a replacement for my one-hot vectors and feeding them into a skip-gram architecture like W2V to learn an image embedding.

In this case the dot product of my input feature vector and the hidden layer weights are no longer a look-up affecting a single row – s how can I connect images with their embeddings in the this setup?

Edit: Some added context – I am trying to get style embeddings in a style vector space using this paper https://arxiv.org/pdf/1708.04014.pdf. The paper describes putting the images through a pre-trained VGG and then using these feature vectors to train the embedding layer. Normally we would then go and take the embedding layer weights, but in this case there isn't a single row that connects with our image because it isn't one-hot. Do we take the embedding layer activations instead of the actual weights?

Solved – Learning image embeddings using VGG and Word2Vec

Best Answer

Related Question

Best Answer

Related Solutions

Solved – Understanding Word2Vec

Solved – Difference between non-contextual and contextual word embeddings

Related Question