Solved – Keyword clustering

clusteringmachine learningtext mining

I have one million of keywords (from search queries in google), and I need to group them semantically. I have already done some research and I have found information about how to extract keywords and cluster them from a large corpus, but in my case I don't have any large documents, only those keywords.

I imagine that clustering these keywords semantically is impossible (although I hope I am wrong) since I guess you need a large text to extract its meaning, and each of my keywords has a maximum of 4 or 5 words. I thought about crawling the web and getting myself a large corpus and use some of the techniques I have seen like TF-IDF and then applying a k-means algorithm, and then I could extract the keywords from those documents and its subject, and then I could compare my keywords to those extracted and cluster them accordingly… But I don't know if this would work.

Could anyone tell me if my approach is correct? If so, once I have clustered the keywords from the documents I get from the web, what kinds of techniques would I need to cluster my own keywords?

Best Answer

I figure the best you can do is to try to map the keywords to Wordnet, and use the wordnet hierarchy as "clustering". You can also try to discover frequent itemsets.

You want to cluster them by their meaning, I guess. How is an algorithm that only has the characters available to discover a meaning here? IMHO you will need to provide any algorithm more data than just the words to produce a sensible result.