I've created a data set containing title
, abstract
and keywords
of scientific articles, I want to train a model to classify and assign keywords to a new article based on its abstract(for now I'm not including title
).
I've almost done the pre-processing(removing stop and single character words, stemming).
Question 1: What are my next steps? Which Classifiers are you suggesting and why?
Question 2: How should I measure the accuracy while some classifier suggested keywords might seem correct but not same with keywords in the data set?
I'm new to machine learning and text classification problems, I know there are enough articles and I've read some but multiplicity of them made me confused. I need ideas from more experienced users to continue.
Best Answer
Then, to change this into a classification problem, create keyword sets. You could incorporate all the keywords or only the important ones. Then try to encode them into vectors. These vectors will be target instances that you would like to predict. One document will have one target instance (i.e., one vector). For instance, if there are 5 keywords $(k_1, k_2, k_3, k_4, k_5) $ and a certain document has two keywords $ (k_2, k_4) $ in it, the target instance of that document could be created as below. $$ [0, 1, 0, 1, 0] $$
Choosing the classifier is relatively a trivial issue at this step. First start with baseline classifiers (SVM, Naive Bayes, ...) and if you cannot gain satisfying results, try other classifiers such as ensembles or neural networks.
Or, if you want to make it more specific, you could come up with a level of 'correctness' in your accuracy measure. For instance, you could use euclidean distances between predicted $y$ and actual $y$ for error computation. This will penalize the type 1 and type 2 errors, while rewarding the correct keyword matches between predicted $y$ and actual $y$.