Solved – Supervised keyword extraction: results

precision-recalltext mining

I've considered the keyword extraction method as a classification problem (1 = author generated keyword, 0 = no keyword) and I've tried to automatically extract keywords from text document. I've done this using different kind of models. These are the results (the columns represent the human-generated keywords, and the row are the automatic extraction):

Baselinmethod:
    1    0
1   62   63   
0   52 
Naive Bayes:
        1    0
    1   29   9  
    0   85 
SVM:
        1    0
    1   29   9  
    0   85 
Decision tree:
        1    0
    1   29   18  
    0   85 



                Average number of keywords extracted   Precision     Recall     
Baseline method       5                                   49,6        54,4    
Naive Bayes          1,52                                 76,3        25,4  
SVM                  1,52                                 76,3        25,4   
Decision tree        1,88                                 61,7        25,4

Now I have to choose the 'best' model. Which model is the model according to you, and why?

I would prefer the baseline method, because it can on average match between one of the two keywords chosen by the author in this collection, so the recall is the largest in comparison with the other models. Is this a good idea?

Best Answer

It sounds like you are actually trying to to perform multi-label prediction (where labels are your author-supplied keywords) on documents by extracting text features (your model's "keywords"). However, human labels are rarely matches to terms extracted from text documents, and in many cases those terms interact to give rise to a collection of terms a human would label with a keyword. For example, a document might get a label called "emergency" from the author, but that word may never appear in the document (or be very rare among many "emergency" documents). Instead, the document may have several terms like "ambulance", "accident", "firefighter", "911", etc., all of which indicate an emergency situation.

You should be extracting terms from your documents using various stemming and vectorization methods, and then feeding those in some form to your algorithms to predict the labels ("keywords"). It isn't clear what you are actually doing, but you need to create a data set such as:


AUTHOR KW | DOCUMENT TEXT  
emergency | There was an accident on I5 this morning.  Ambulances responded, along with firefighters.  The accident was fatal and resulted in a long delay in traffic.
...

From that raw data, you extract terms into a term frequency vector (or weight it by by the inverse document frequency to get the "TF/IDF"). For example, in the above text, you may get the terms:


accident, I5, morning, ambulance, respond, along, firefighters, fatal, result, long, delay, traffic

Then your data input to your Naive Bayes, SVM, and Decision Tree (might want to try Random Forests, too) using a simple count vector would be:


LABEL | TERM 1 | TERM 2 | TERM 3 | TERM 4 | TERM 5 | TERM 6 | TERM 7 | TERM 8 | TERM 9 | TERM 10 | TERM 11 | TERM 12 
1     | 2      | 1      | 1      | 1      | 1      | 1      | 1      | 1      | 1      | 1      | 1      | 1

Of course, with many documents, you end up with many, many more terms, and most of the documents have zero counts for many of the terms in other documents, making the input data a sparse matrix. Depending on what language your are using, this can speed things up.

This is basic document classification, so searching for that should reveal much to you. If you want to provide more details about what language and packages you are using, and some sample data, we can be more specific on how to proceed.