Solved – How to apply word2vec for k-means clustering

k-meansspark-mllibtext miningword2vec

Background: I am new to word2vec. With applying this method, I am trying to form some clusters based on words extracted by word2vec from scientific publications' abstracts. To this end, I have first retrieved sentences from the abstracts via stanfordNLP and put each sentence into a line in a text file. Then the text file required by deeplearning4j word2vec was ready to process (http://deeplearning4j.org/word2vec).

Since the texts come from scientific fields, there are a lot of mathematical terms or brackets. See the sample sentences below:

The meta-analysis showed statistically significant effects of pharmacopuncture compared to conventional treatment = 3.55 , P = .31 , I-2 = 16 % ) . 

90 asymptomatic hypertensive subjects associated with LVH , DM , or RI were randomized to receive D&G herbal capsules 1 gm/day , 2 gm/day , or identical placebo capsules in double-blind and parallel fashion for 12 months . 

After preparing the text file, I have run word2vec as below:

SentenceIterator iter = new LineSentenceIterator(new File(".../filename.txt"));
        iter.setPreProcessor(new SentencePreProcessor() {
            @Override
            public String preProcess(String sentence) {
                //System.out.println(sentence.toLowerCase());
                return sentence.toLowerCase();
            }
        });


    // Split on white spaces in the line to get words
    TokenizerFactory t = new DefaultTokenizerFactory();
    t.setTokenPreProcessor(new CommonPreprocessor());

    log.info("Building model....");
    Word2Vec vec = new Word2Vec.Builder()
            .minWordFrequency(5)
            .iterations(1)
            .layerSize(100)
            .seed(42)
            .windowSize(5)
            .iterate(iter)
            .tokenizerFactory(t)
            .build();

    log.info("Fitting Word2Vec model....");
    vec.fit();

    log.info("Writing word vectors to text file....");

    // Write word vectors
    WordVectorSerializer.writeWordVectors(vec, "abs_terms.txt");

This script creates a text file containing many words withe their related vector values in each row as below:

pills -4.559159278869629E-4 0.028691953048110008 0.023867368698120117 ...
tricuspidata -0.00431067543104291 -0.012515762820839882 0.0074045853689312935 ...

As a subsequent step, this text file has been used to form some clusters via k-means in spark. See the code below:

val rawData = sc.textFile("...abs_terms.txt")
val extractedFeatureVector = rawData.map(s => Vectors.dense(s.split(' ').slice(2,101).map(_.toDouble))).cache()

val numberOfClusters = 10
val numberOfInterations = 100

//We use KMeans object provided by MLLib to run
val modell = KMeans.train(extractedFeatureVector, numberOfClusters, numberOfInterations)

modell.clusterCenters.foreach(println)


//Get cluster index for each buyer Id
val AltCompByCluster = rawData.map {
  row=>
    (modell.predict(Vectors.dense(row.split(' ').slice(2,101)
      .map(_.toDouble))),row.split(',').slice(0,1).head)
}

AltCompByCluster.foreach(println)

Questions: As a result of the latest scala code above, I have retrieved 10 clusters based on the word vectors suggested by word2vec. However, when I have checked my clusters no obvious common words appeared. That is, I could not get reasonable clusters as I expected. Based on this bottleneck of mine I have a few questions:

1. From some tutorials for word2vec I have seen that no data cleaning is made. In other words, prepositions etc. are left in the text. So how should I apply cleaning procedure when applying word2vec?

2. How can I visualize the clustering results in a explanatory way?

3. Can I use word2vec word vectors as input to neural networks? If so which neural network (convolutional, recursive, recurrent) method would be more suitable for my goal?

4. Is word2vec meaningful for my goal?

Best Answer

In my experience, the more data you give word2vec, the better it performs. If none of your clusters make sense, I would throw more data at it. I did something similar to what you are doing and trained with 20k sentences and clusters were bad. When I trained with 2 million sentences the clusters were very good. Now, how to cluster millions of vectors is it's own problem, but you can look online about big data clustering algorithms and experiment with different packages and maybe different ideas that you come up with.

As a first order measure, you can train word to vec and then look at the distances between a few hundred similar words like 'genes' and 'chromosomes', 'molecule' and 'ion'. If the distances are low, then it' s time to start thinking about how to automate clustering.