Solved – Performing Word Embeddings with domain-specific data

classificationpythonword embeddingsword2vec

I am new to word-embeddings and have only worked with older approaches like bag of words/tf-idf. Unlike td-idf or bag of words, I have to first train a model to perform the embeddings.

If working with domain-specific documents, can I train word2vec on those documents and then perform word-embedding with the trained model instead of an easily available, pre-trained model? If not, what are suggestions to embed domain-specific data for classification, since similar data may not be readily available?

Best Answer

Yes, you can use your own corpus entirely when trying to create word embeddings. This is the path you to want to pursue if you have any words in your corpus that would be out-of-vocabulary for pretrained embeddings, or if the corpus used to create those pretrained embeddings is ill-suited for your problem at hand.

If you did want to include pretrained embeddings with your additional vocabulary, there are ways around it. One example would be training a model on your limited corpus to get vectors for the words that are out of vocabulary and concatenating them to the pretrained set. I haven't tried and can't speak to how successful/useful that approach is, but it's an option.

I've only used gensim in the past to train word2vec models, so I can't speak to other libraries out there that have implemented some flavor of word embedding. Here's the doc for gensim's word2vec implementation, which generally gives you a lot of levers to pull in terms of how your vectors are generated.