Solved – Understanding how word embedding with Fasttext works for the case

machine learningnatural languageneural networksword embeddings

I'm looking for some guidance with Fasttext and NLP to help understand how the model proceed to calculate the vector of a sentence.

Context:

I'm using the fasttext method get_sentence_vector() to calculate the vector of a query sentence that I will call P1, as well as for a set of sentences (P2, P3, P4, P5, …, Pn). Sentences can have one or more words. Then, I calculate the distance between the vector of the sentence P1 with that of each of the other sentences to finally obtain the list of the sentences closest to P1. Please note that i'm doing a preprocessing only on P1 (removal of numbers and punctuation + tokenization and lemmatization with SpaCy). The goal is to get the sentences that come closest in terms of meaning


The problem is that I do not understand the results I get for different cases:

case 1: P1 = "biofertilizers"

  • distance between vectors "biofertilizers" and "chemical fertilizers" : 0.48
  • distance between vectors "biofertilizers" and "bio-fertilizers" : 0.49

Here, i don't understand how vector calculated with fasttext of "biofertilizers" is closer to "chemical fertilizers" than "bio-fertilizers". Is the dash counted during the vector calculation ? Bio-fertilizers should logically be closer don't you think ?

case 2: P1 = "laptop"

  • distance between vectors "laptop" and "battery chargers for laptop computers" : 0.16
  • distance between vectors "laptop" and "tablet computers" : 0.27

This is not correct because "tablet computers" is closest to "laptop" in term of meaning than "battery chargers for laptop computers". Is it because the latter contain the word "laptop" that the distance is lower ?

case 3: P1 = "knives":

The distance between "knives" and "tableware, except forks, knives and spoons" is low and these two sentences are considered to be close. This should not be the case because their meanings are opposed. So I assume that Fasttext does not assimilate the negation words like 'Except' or 'not' during the vector calculation?


How does Fasttext arrive at these results when calculating distances between vectors?

I am also interested in hearing other suggestions for calculating the degree of semantic proximity between two sentences.

Best Answer

Your questions mostly concern the implementation of Fasttext rather than the underlying statistical concepts. I couldn't find clear documentation on how sentence embeddings are calculated from the words embeddings, but looking at the C code provided some hints.

  1. The doc string for get_sentence_vector [here][https://github.com/facebookresearch/fastText/blob/2bef44b36fe6fb900e20fdb5828e7602484a5d29/python/fasttext_module/fasttext/FastText.py] says:

    Given a string, get a single vector representation. This function
    assumes to be given a single line of text. We split words on
    whitespace (space, newline, tab, vertical tab) and the control
    characters carriage return, formfeed and the null character.
    

    I understand this to mean that hyphens are not split and words with hyphens are therefore considered a single token to be learned.

  2. Looking at getSentenceVector here, I think that the vector for each word is normalized by its sum and the normalized vectors are summed up. I suspect this method overemphasizes the occurrence of 'laptop' in your example.

  3. No, as we know from 2., fasttext sums up the meanings for each word. If you want to 'understand' the sentence including negations, then actual language models that model not only the words but the sentence as a structure will be more helpful.

For example, look at recurrent neural network architectures or transformer architectures. Recent language models relax the bag-of-words assumption and process the words in order. RNNs will process the embedded word input one word at a time and aggregate the processed content into a latent state. At some point, the hidden state is typically passed on to a fully connected layer to make a prediction used for training. However, you could load well-optimized weights, use the recurrent layers to calculate the latent state for a phrase while ignoring the later layers, and then compare the hidden states of phrases/sentences.

Recent architecture for which pre-trained weights and tutorials are available are ULMfit or ELMo. For both these models, common use would be to not train the model on your data at all (although finetuning is possible), but to download a set of weights trained on a large corpus.

Related Question