There are several ways to evaluate keywords ...
Stand alone (evaluating only one generator at a time)
According to wikipedia (Index Term as a synonym for keyword in Information Retrieval), a keyword is
(a) term that captures the essence of the topic of a document
which can be either mean that the term is a summary which does not appear in the document (hard for machines) or a term, which (maybe in variations) appears often in the document (easy for machines), but not too often (so that it might be a common word like "and"). A commonly used method here is the TF-IDF-score.
But what means "often" and "too often" ? This is unclear ... it is in the eye of the beholder ... and exactly the reason why this sort of standalone validation is not possible
Comparing the output of two keyword generators
... for the same set of documents. Assuming that you trust one of the generator and hence use it as reference, you can calculate the overlap using e.g. Jaccard Index.
As a result, the keywords of your generator are as valid as the one from the reference generator, but not necessary valid or useful per se.
Evaluating the keyword relevance for an application
... to illustrate the issue why standalone validation is not possible.
Suppose you have two documents, each containing the following words (among useless others)
- document A: love, feeling
- document B: hate, feeling
and 100000 more documents all about statistics where neither of both words does appear.
Now you have to pick one, only one. Which one is the best ?
It depends ...
- If you want to cluster the documents according to their topic, you have to use feeling.
- If you want to create a sentiment classifer, which labels all documents as positive, negative or neutral, you have to use love and hate, because otherwise you cannot distinguish both.
In summary one can easy evaluate whether a set of keywords is useful for an application, may it be a sentiment classifer, a spam detector or a search engine. But it is not said that a keyword useful for one application is useful for another one, too.
Update
Seems to be a rule of the internet: Everything you can think of is probably already a research discipline: Terminology Extraction.
I think the issue here may be the use of term vectors. Your instances (bags of words) are translated to a vector of probably 150 to 10000 dimensions. Each word that occurs in your corpus (the websites) is one dimension and the value for each instance is the frequency (it the tf/idf score) of that word in the given website.
In a space with that many dimensions, most machine learning algorithms will suffer. You've chosen fairly lightweight algorithms, but they may still take a while to converge, depending on how they're implemented.
The most common classifier in this scenario is naive Bayes, which doesn't see the instance space as a high dimensional space, but just as a collection of frequencies (from which it estimates, using Bayes theorem, the probability of each class). Training this classifier should take as long as it takes to read the data once, and classification should take as long as it takes to read the instance. Since it shouldn't have any parameters, it will at least give you a good baseline. Nltk almost certainly has this algorithm (it's the mother of spam detection).
Another option, if you want to use more traditional ML algorithms, is to reduce the dimensionality of the dataset to something manageable (anything below 50), using PCA. This will take more time, and make it more difficult to update your classifier, but it can lead to good performance.
Best Answer
It sounds like you are actually trying to to perform multi-label prediction (where labels are your author-supplied keywords) on documents by extracting text features (your model's "keywords"). However, human labels are rarely matches to terms extracted from text documents, and in many cases those terms interact to give rise to a collection of terms a human would label with a keyword. For example, a document might get a label called "emergency" from the author, but that word may never appear in the document (or be very rare among many "emergency" documents). Instead, the document may have several terms like "ambulance", "accident", "firefighter", "911", etc., all of which indicate an emergency situation.
You should be extracting terms from your documents using various stemming and vectorization methods, and then feeding those in some form to your algorithms to predict the labels ("keywords"). It isn't clear what you are actually doing, but you need to create a data set such as:
From that raw data, you extract terms into a term frequency vector (or weight it by by the inverse document frequency to get the "TF/IDF"). For example, in the above text, you may get the terms:
Then your data input to your Naive Bayes, SVM, and Decision Tree (might want to try Random Forests, too) using a simple count vector would be:
Of course, with many documents, you end up with many, many more terms, and most of the documents have zero counts for many of the terms in other documents, making the input data a sparse matrix. Depending on what language your are using, this can speed things up.
This is basic document classification, so searching for that should reveal much to you. If you want to provide more details about what language and packages you are using, and some sample data, we can be more specific on how to proceed.