Solved – a multimodal embedding

computer visionconv-neural-networkdeep learningterminology

I don't have computer vision background, yet when I read some image processing and convolutional neural networks related articles and papers, I constantly face the term, multimodal embedding. Can anybody provide some intuition behind this term?

I usually encounter this term in image captioning and image recognition.

Best Answer

The link shows what multimodal embeddings are. Multimodal refers to an admixture of media, e.g., a picture of a banana with text that says "This is a banana." Embedding means what it always does in math, something inside something else. A figure consisting of an embedded picture of a banana with an embedded caption that reads "This is a banana." is a multimodal embedding.

Edit For @Herbert From this: In the context of neural networks, embeddings are low-dimensional, learned continuous vector representations of discrete variables. Elsewhere, one finds this: An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.

In terms of what embedding usually means, the neural network definition of embedding seems to me to be particular to that field. That is, it has some of the characteristics features of an embedding in a larger sense, but is more figurative than exact.

In general, the word embedded can be used somewhat figuratively or "metaphorically." For example, a dictionary definition is

verb (used with object), em·bed·ded, em·bed·ding.

  1. to fix into a surrounding mass: to embed stones in cement.
  2. to surround tightly or firmly; envelop or enclose: Thick cotton padding embedded the precious vase in its box.
  3. to incorporate or contain as an essential part or characteristic: A love of color is embedded in all of her paintings.

I am not a grammarian, but it seems to me that the third definition above is a figure of speech, a hyperbole, a metaphor, and is inexact. Whereas such things are common linguistically, they are not literal, and in that sense, the usage of the word embedded for neural networks is somewhat jargonesque.