Machine Learning – Multi-Class Classification with Word2Vec

classificationmachine learningnatural languageword2vec

My problem: The input data is a corpus of short documents (a few sentences each). In each document some expressions need to be classified to categories. A document must contain some categories (each expression has a single label), and the rest are optional. The task: given such an expression and its surrounding words, classify its category.

As a solution I thought to convert my vocabulary words to vectors using word2vec, and then apply some multi-class classifier.

Is there any classifier which is a particularly good fit to word2vec's output? I thought using svm, is there a recommended kernel?

Best Answer

It is always hard to assess a priori the performance of a pre-treatment on the data. Even something as simple as normalizing the data does not have an obvious influence on the performance on the later trained classifiers (see per example this post : Normalizing data worsens the performance of CNN?).

However the following links may help you implement your idea :

Text Classification With Word2Vec the author assesses the performance of various classifiers on text documents, with a word2vec embedding. It happens that the best performance is attained with the "classical" linear support vector classifier and a TF-IDF encoding (the approach is really helpful in terms of code, especially if you work with python and sk-learn)

Regarding SVMs, there are kernels designed for text. I once had nice results with Information diffusion kernels and TF-IDF encoding. Or you have kernels that works directly on strings : Text Classification using String Kernels, but their implementations are scarcer...