Solved – Performing Word Embeddings with domain-specific data

classificationpythonword embeddingsword2vec

I am new to word-embeddings and have only worked with older approaches like bag of words/tf-idf. Unlike td-idf or bag of words, I have to first train a model to perform the embeddings.

If working with domain-specific documents, can I train word2vec on those documents and then perform word-embedding with the trained model instead of an easily available, pre-trained model? If not, what are suggestions to embed domain-specific data for classification, since similar data may not be readily available?

Best Answer

Yes, you can use your own corpus entirely when trying to create word embeddings. This is the path you to want to pursue if you have any words in your corpus that would be out-of-vocabulary for pretrained embeddings, or if the corpus used to create those pretrained embeddings is ill-suited for your problem at hand.

If you did want to include pretrained embeddings with your additional vocabulary, there are ways around it. One example would be training a model on your limited corpus to get vectors for the words that are out of vocabulary and concatenating them to the pretrained set. I haven't tried and can't speak to how successful/useful that approach is, but it's an option.

I've only used gensim in the past to train word2vec models, so I can't speak to other libraries out there that have implemented some flavor of word embedding. Here's the doc for gensim's word2vec implementation, which generally gives you a lot of levers to pull in terms of how your vectors are generated.

Related Solutions

Solved – Apply word embeddings to entire document, to get a feature vector

One simple technique that seems to work reasonably well for short texts (e.g., a sentence or a tweet) is to compute the vector for each word in the document, and then aggregate them using the coordinate-wise mean, min, or max.

Based on results in one recent paper, it seems that using the min and the max works reasonably well. It's not optimal, but it's simple and about as good or better as other simple techniques. In particular, if the vectors for the $n$ words in the document are $v^1,v^2,\dots,v^n \in \mathbb{R}^d$, then you compute $\min(v^1,\dots,v^n)$ and $\max(v^1,\dots,v^n)$. Here we're taking the coordinate-wise minimum, i.e., the minimum is a vector $u$ such that $u_i = \min(v^1_i, \dots, v^n_i)$, and similarly for the max. The feature vector is the concatenation of these two vectors, so we obtain a feature vector in $\mathbb{R}^{2d}$. I don't know if this is better or worse than a bag-of-words representation, but for short documents I suspect it might perform better than bag-of-words, and it allows using pre-trained word embeddings.

TL;DR: Surprisingly, the concatenation of the min and max works reasonably well.

Reference:

Representation learning for very short texts using weighted word embedding aggregation. Cedric De Boom, Steven Van Canneyt, Thomas Demeester, Bart Dhoedt. Pattern Recognition Letters; arxiv:1607.00570. abstract, pdf. See especially Tables 1 and 2.

Credits: Thanks to @user115202 for bringing this paper to my attention.

Solved – deep learning – word embedding with parts of speech

1. Concatenating word2vec and POS features

Adding POS information to your classifier is fine. You will of course want to create a train/dev/test split, eg 5-way cross-validation, to test to what extent adding this information improves your results (it's data dependent, really depends on your data, only you can test this, using your own data).

To combine the POS and word2vec features, you can simply concatenate them. I assume when you say 'CNN', you mean '1-dimensional CNN', is that right? So your input data if you were just using word2vec features, would be something like:

[batch size][sequence length][word2vec dimensions (ie 300)]

ie, a batch size * sequence_length * word2vec_dim sized tensor. So, concatenating with the POS features, your input data tensor will become:

[batch size][sequence length][
    word2vec dimensions (ie 300) + POS dimensions (ie 20)]

ie a batch size * sequence_length * 320 sized tensor.

2. sense2vec

You might also want to check out sense2vec, from Trask et al, 2016, https://arxiv.org/pdf/1511.06388.pdf , which makes use of POS information to disambiguate word2vec embeddings:

"This paper presents a novel approach which addresses these concerns by modeling multiple embeddings for each word based on supervised disambiguation, which provides a fast and accurate way for a consuming NLP model to select a sense-disambiguated embedding. We demonstrate that these embeddings can disambiguate both contrastive senses such as nominal and verbal senses as well as nuanced senses such as sarcasm. We further evaluate Part-of-Speech disambiguated embeddings on neural dependency parsing, yielding a greater than 8% average error reduction in unlabeled attachment scores across 6 languages."