Solved – Is this a correct way to do document classification using topic modeling

I am using LDA to extract topics. I want to do topic modelling and use the topics as features to do document classification.

I am proposing the below approaches using scikit-learn. I want to know which among the two approaches is the right way to do it for multi class document classification where each document is labeled with one among many classes.

Type 1.
1. Take the training set of documents and extract say N topics.
2. Get the topic distribution of N topics for all the labeled documents.
3. Represent each document as a vector of topic proportions. This is the feature value vector for a training document.
4. Build a multi-class SVM on the training set of feature vectors.
5. Take a test document and extract topics using the previously built topic model and get topic proportions.
6. Feed it to SVM to get the document class prediction.

Type 2.
1. Take the training set of documents and divide it into separate clusters based on the class labels.
2. Extract say N topics for each of the cluster of documents (each cluster will be represented by their own unique set of N topics). Also get the topic proportions for any document in any cluster.
3. Represent a document by a vector of topic proportions. Append zeros to all other elements of the vector so that the length of all document vectors is the same.
4. Train multi class SVM.
5. Use the test set document and pass it through all the topic models separately that were built above to get topic proportions.
6. Represent the document by a vector that has all the topic proportions as a result of passing it through all the topic models.
7. Feed it to SVM to get the document class prediction.

As you see in Type 1 and Type 2, the only difference is that we build separate topic models on the different classes.

Which among the above two types of ways to build the machine learning model is the right way to do document classification using topic models.

Best Answer

The approach termed "Type 1" is already explored in the paper by Blei et al. (2003) http://jmlr.org/papers/volume3/blei03a/blei03a.pdf in §7.2. The result is that this approach is valuable in feature selection for the SVM.

So "Type 1" is definitely one right way.

I have no comments "Type 2" except that it lacks the clarity that "Type 1" has.

Best Answer

Related Solutions

Solved – What are the differences between document classification and clustering when working with a single topic

Solved – tf-idf in multi-label classification task

Related Question