MATLAB: FastText word embedding support package

fasttextnlptext analytics

I'm using the Text Analytics Toolbox and the Pretrained fastText word embedding support package. Is it possible for me to add addional words to the pretrained vocabulary?

Best Answer

To answer my own question, the following example code shows how to add words to the embedding vocabulary. This requires a new embedding object to be created.

>> emb = fastTextWordEmbedding; 
>> vocab = emb.Vocabulary; 
>> mat = word2vec(emb, vocab);
>> newvocab = [vocab "New Word 1" "New Word 2"]; 
>> newmat = [mat; randn(2,300)]; 
>> newemb = wordEmbedding(newvocab, newmat);

In addition, I have confirmed it is possible to use the fastText pretrained 2 Million words (600 billion tokens) rather than the default 1 Million words (16 billion token) which is provided with the MATLAB fastTextWordEmbedding function.

To do this, replace the "wiki-news-300d-1M.vec.zip" file with the alternative pre-trained word vectors file from https://fasttext.cc/docs/en/english-vectors.html

Related Solutions

MATLAB: Increasing vocabulary of pre-trained word embeddings

Yes. In order to add more words to the existing vocabulary given by 'fastTextWordEmbedding', you can try the following:

1. Obtain the wordEmbedding object for 'fastTextWordEmbedding'-

>> emb = fastTextWordEmbedding;

2. Obtain the vocabulary from the wordEmbedding object:

>> vocab = emb.Vocabulary;

3. Add more words to the string array, for example:

>> vocab(end+1) = 'Hi';
>> vocab(end+1) = 'Hello';

4. Write to a text file with UTF-8 encoding in either the word2vec or GloVe text embedding format, or a zip file containing a text file of this format. You can use fopen, fprintf and fclose for this step:

www.mathworks.com/help/matlab/ref/fopen.html

www.mathworks.com/help/matlab/ref/fprintf.html

www.mathworks.com/help/matlab/ref/fclose.html

5. Use 'readWordEmbedding' to read this text file with additional words, to get a new word embedding object. The doc page for 'readWordEmbedding' would explain more about why the file needs to be in the above format.

MATLAB: Import pre-trained word embeddings (GloVe, Skipgram, etc.) in Deep Neural Network models.

You can use a pre-trained embedding model to initialize the Weights property of the wordEmbeddingLayer. For example:

% Import your pretrained word embedding model of choice
emb = readWordEmbedding('existingEmbeddingModel.vec');
embDim = emb.Dimension; 
numWords = numel(emb.Vocabulary); 
 
% Initialize the word embedding layer 
embLayer = wordEmbeddingLayer(embDim, numWords); 
embLayer.Weights = word2vec(emb, emb.Vocabulary)';  
 
% If you want to keep the original weights "frozen", uncomment the following line
% embLayer.WeightLearnRateFactor = 0

The wordEmbeddingLayer with initialized Weights can then be placed in the network before lstmLayer.

Also note that training documents should be mapped according to the vocabulary of the pre-trained embedding model, before passing to the net for training, for example:

enc = wordEncoding(tokenizedDocument(emb.Vocabulary,'TokenizeMethod','none'));
XTrain = doc2sequence(enc,documentsTrain,'Length',75);

Best Answer

Related Solutions

MATLAB: Increasing vocabulary of pre-trained word embeddings

MATLAB: Import pre-trained word embeddings (GloVe, Skipgram, etc.) in Deep Neural Network models.

Related Question