Solved – combining text and non-text features in a classification model

feature selectionfeature-engineeringmachine learning

I am new to ML, so please interpret this question accordingly… I am not sure if this is a common issue or not, or if I am thinking about this the right way.

Here is what I am trying to do:

I have a bunch of text fragments which I want to classify into certain topics. The text fragments are the titles of support tickets, so for example the title "My laptop is broken, please help" might get classified into the Hardware category, and the title "I would like a refund for my July bill" might get classified into the Finance category.

So far this is straightforward. However, I have a lot of metadata that would probably be useful to include in my model. For example I know how long somebody has been a customer w/ the company, which could be one feature. I know the age of each customer, which could be another feature. Etc.

What I'm not sure is, what is the best way to combine these metadata features with the text features? For the text features I am using something like tf-idf, so I'll have one feature for each word in the vocabulary, and the feature list will be very long since the vocabulary is large. I suppose I could manually append these metadata features to the end of the vocabulary, but it seems a little ridiculous to append 10 features to a feature vector 100k features long. And I'm not sure if it would work right.

FWIW I am using scikit-learn, but I'm not sure if it has any functionality that would help here.

Best Answer

I am not aware of a standard way as such but here's a one thing I'll try. This will contain two models in the pipeline.

  1. Train on the textual data to predict a class (like Fiance, Hardware) and get the model's prediction as a one categorical variable.
  2. Append that categorical variable to the existing metadata featues, and train a new model

I could also replace the step 1 with: rather than outputting a one class (one that has the highest probability), using the whole set of probabilities that the first model predicted for each class and append those numeric features to the metadata featues to use in the step 2.

Related Question