Solved – Using word embeddings in text classifier

classificationword embeddings

I have a bunch of sentences that I want to do binary classification with SVM.

My sentences have varying lengths form 4 to 34. If I use word embeddings such as word2vec or skip gram to convert my words into word vectors, I would end up with matrices of very different sizes due to differences in sentence length.

What is the best way to get around that? I know that if I were to use a neural network classifier, I would just pad with zeroes and let the neural network figure out the features. But If I were to use a classical machine learning classification method, what is the best way to deal with sentences of varying lengths?

Best Answer

Have you checked doc2vec? Doc2vec returns a fixed representation for each document (sentence in your case), regardless of its length. Here is the original paper. There is also a Python implementation from the gensim package.

Another usual approach for text classification is to calculate the tf-idf matrix and use it as input to a classifier, in which case the columns of the matrix are the features. The tf-idf matrix represents the sentences in its rows and the unique terms (words) of all your sentences in its columns. Each element in the matrix is the tf-idf value of this sentence-term. You can find more information in the wiki page. scikit-learn has an implementation of tf-idf here.