Solved – How to incorporate meta data into text classification model

machine learningnatural languagetext mining

I looked here: How it's better to include non-word features into text classification model? but there aren't any useful answers.

I have a possibly naive question: I'd like to incorporate meta data into a text classification model. However I'm not sure how to proceed.

Assume that I have a dataset that is $N \times 3$, where the columns are:

  1. text document – for example, an amazon review or newspaper article
  2. some meta_data – for example, number of words of length > 5, or time article was published
  3. category – either A, B or C

The goal is to use the text document and the meta_data to classify the example in the correct category.

Typically one would perform text classification on the text document (tokenize, lemmatize, remove stopwords, etc…) and build a sparse matrix of word counts. A model (for example SVM is popular) would be trained on this sparse matrix and tested on some unseen data, whereby it would be classified A, B or C.

But what about the meta data? I'd like to incorporate that somehow but in this paradigm it's unclear to me where I can inject it. I feel like what I want is a model of the form:

$y = \beta_0X_0 + \beta_1X_1$

Where $X_0$ is the meta data and $X_1$ is the result of the NLP part. But how would I set up such a model? Can I reduce the text classification portion into a single coefficient? Or am I conflating two distinct approaches of modeling text?

Best Answer

Just use the metadata features as features for the SVM.

Typically the features that you feed into the SVM would be the $n \times k$ document-term matrix $\mathbf{T}$. You also have the $n \times j$ matrix of metadata features $\mathbf{M}$ (not including the category). So you want to give your SVM algorithm the combined $n \times (k + j)$ matrix $$ \left( \array{\mathbf{T} & \mathbf{M}} \right)$$