Solved – Mix of text and numeric data

classificationdata miningmachine learningtext mining

I have to train a classification model with 15 classes based on data which contains both textual and numeric data. For instance: product description(textual), product length'(numeric). I have experience with Text mining but only with textual data. My approach would be to separate the textual and numeric data, create dfm and then merge it with numeric data. But I am open to other better approaches.

Best Answer

You have two main options here:

  1. As you said, create some numeric features out of the text description and merge it with the rest of the numeric data. The features created out of the text description can be either the document-term matrix (with tf-idf or not), can be SVD components or even averaged word-vectors (look for word2vec etc).

  2. You can build two separate classifiers (one using text data only and one using numeric only) and then combine their output using some meta-modelling.