Solved – Automatic keyword extraction: using cosine similarities as features

cosine distancecosine similarityfeature-engineeringsupervised learningtext mining

I've got a document-term matrix $M$, and now I would like to extract keywords for each documents with a supervised learning method (SVM, Naive Bayes, …). In this model, I already use Tf-idf, Pos tag, …

But now I'm wondering about nexts. I've got a matrix $C$ with the cosine similarities between the terms.

Is there a possibility to use this similarities as a feature for my model? My idea was for term $i$ in document $d$, to use the average of the cosine similarities of all the terms in document $d$ with term $i$. Is this useful?

Best Answer

I don't know how it's possible to do keyword extraction with supervised learning, but I do know how to do it with unsupervised learning.

There are several methods of doing this, so here they are:

Hierarchical

You can apply any hierarchical clustering method on the term similarity matrix directly (with any similarity function, not just cosine)

In scikit-learn you'd do something like this:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import AgglomerativeClustering

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(data)
C = 1 - cosine_similarity(X.T)
ward = AgglomerativeClustering(n_clusters=k, linkage='ward').fit(C)
label = ward.labels_

Source: [1]

But since it's agglomerative clustering, it's computationally expensive and it'll take a while to compute.

K-Means

Another possibility is to do usual k-means on rows of the term-document matrix, and then find most common terms for each centroid

For example, in scikit learn this is the way of doing it:

from sklearn.cluster import KMeans

km = KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=1)
km.fit(X)
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(k):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')

Source: [2]

But k-means relies on Euclidean distance, which is bad for sparse high-dimensional data. There are other techniques that work better for texts and use cosine similarity

Cosine K-Means and Scatter/Gather

It's possible to use Cosine with K-means (see e.g. [3]): calculate centroids as a mean over all documents in each cluster, and then use cosine to calculate the distance to the closest centroid.

At the end, you can extract keywords the same way as for usual k-means.

Calculating the average centroid as a mean of all documents in the cluster is not always good. Another approach is suggested in the Scatter/Gather algorithm [4]: the centroid of a cluster is concatenation of all the documents in this cluster.

For this approach you'll just need to take most frequent terms for each centroid cluster.

There's no implementation of these algorithms in scikit learn, but you can easily implement them yourself by extending KMeans.

Note that in both cases the centroids become quite dense: denser than the rest of the documents in each clusters, so you may want to truncate terms in the centroids, i.e. remove "unimportant" ones. (see [8]).

Spectral Clustering

Another way would be to apply spectral clustering. You'll need to supply a similarity matrix, which you already have, and it will find clusters on it.

It's implemented in the SpectralClustering class, see examples in [5]. Note that since you already have a pre-computed matrix, you need to use affinity='precumputed' attribute when initializing.

Spectral clustering is related to Kernel KMeans: there is paper (see [7]) that shows that they are the same thing. I recently came across an implementation of Kernel KMeans that may be useful: https://gist.github.com/mblondel/6230787

Non-Negative Matrix Factorization

Finally, you can cluster you term-document matrix with some decomposition techniques from Linear Algebra, like SVD (this would be so-called "Latent Semantic Analysis") or Non-Negative Matrix Factorization. The latter can be seen as clustering, and it can cluster both rows and columns of the matrix at the same time.

For example, you can extract keywords by doing

from sklearn.decomposition import NMF
nmf = NMF(n_components=k, random_state=1).fit(X)

feature_names = vectorizer.get_feature_names()

for topic_idx, topic in enumerate(nmf.components_):
    print("Topic #%d:" % topic_idx)
    print(" ".join([feature_names[i]
                    for i in topic.argsort()[:-10-1:-1]]))
    print()

Code source: [6]

Even though here the examples are in python scikit-learn, I think it shouldn't be a big problem to find some examples for R

Sources