Solved – K-means clustering feature selection

feature selectionk-meansmachine learning

I have a set of English and foreign language documents that I would to perform k-means clustering on to find document groups by topic. These documents are concatenated social media comments for individual users, which may be 1000-5000 words. (For example, a document may contain every question user 'Paul' has asked on StackExchange.)

I have performed feature extraction in two ways:

(1) extracting entity pronouns from a known list, which proved to have too poor coverage on the document set

(2) raw tokenisation, which proved too noisy with my poor quality feature selection

My usual strategy for feature selection for supervised learning is to calculate TF/IDF scores and systematically try different thresholds to provide the best cross validation performance. (Clearly impossible here.)

I have TF/IDF scores and I've tried using intuition to select a good threshold, but I'm struggling to evaluate if the clustering is good or bad. Google has returned some papers but nothing that is as prescriptive as I would like.

Any suggestions on getting started with feature selection for k-means or other unsupervised clustering?

Many thanks.

Best Answer

Dangers of TF-DIF on different languages

First of all, as stressed by @Roosh a TF-IDF on documents having documents in different languages seems to be a bad idea to me.

Imagine the case where two languages do not share the same alphabet (like Japanese and English). In the TF-IDF embedding, they will leave on different axises. And the same thing would happen for language who do not share (m)any words (like French and German).

So my first impression is that unsupervised clustering on documents in different languages will create clusters mostly related to the languages themselves.

Another approach

Therefore you may want to use a different embedding than TFIDF. Word2Vec (described in Distributed Representations of Words and Phrases and their Compositionality) has been applied with two different languages simultaneously: Joint word2vec Networks for Bilingual Semantic Representations