Solved – Text classification beginner steps

classificationnaive bayespythonsvmtext mining

I've created a data set containing title, abstract and keywords of scientific articles, I want to train a model to classify and assign keywords to a new article based on its abstract(for now I'm not including title).

I've almost done the pre-processing(removing stop and single character words, stemming).

Question 1: What are my next steps? Which Classifiers are you suggesting and why?

Question 2: How should I measure the accuracy while some classifier suggested keywords might seem correct but not same with keywords in the data set?

I'm new to machine learning and text classification problems, I know there are enough articles and I've read some but multiplicity of them made me confused. I need ideas from more experienced users to continue.

Best Answer

  1. Before building a classifier right away, you should build features and target instances (i.e., keyword sets). What I am saying is that the machine cannot interpret textual data as we do. Hence, we have to change the raw text into a numerical format. My suggestion in this step is to exploit unsupervised learning methods for text data such as SVD, word2vec, and GloVe. They enables vector-space mappings from words to vectors (i.e., sequence of numbers).

enter image description here

Then, to change this into a classification problem, create keyword sets. You could incorporate all the keywords or only the important ones. Then try to encode them into vectors. These vectors will be target instances that you would like to predict. One document will have one target instance (i.e., one vector). For instance, if there are 5 keywords $(k_1, k_2, k_3, k_4, k_5) $ and a certain document has two keywords $ (k_2, k_4) $ in it, the target instance of that document could be created as below. $$ [0, 1, 0, 1, 0] $$

Choosing the classifier is relatively a trivial issue at this step. First start with baseline classifiers (SVM, Naive Bayes, ...) and if you cannot gain satisfying results, try other classifiers such as ensembles or neural networks.

  1. This is where you can suit yourself. You can compare the predicted target ($ \hat{y} $) and the actual target ($y$) and create any (logical) accuracy measurement. For instance, in my example below, you could classify as 'correct' if the predicted $y$ and the actual $y$ are identical, and 'incorrect' if not. You could compare the number of 'correct' and 'incorrect' for the overall accuracy.

Or, if you want to make it more specific, you could come up with a level of 'correctness' in your accuracy measure. For instance, you could use euclidean distances between predicted $y$ and actual $y$ for error computation. This will penalize the type 1 and type 2 errors, while rewarding the correct keyword matches between predicted $y$ and actual $y$.