Solved – K-means on cosine similarities vs. Euclidean distance (LSA)

cosine distancecosine similarityk-meanslatent-semantic-analysissvd

I am using latent semantic analysis to represent a corpus of documents in lower dimensional space. I want to cluster these documents into two groups using k-means.

Several years ago, I did this using Python's gensim and writing my own k-means algorithm. I determined the cluster centroids using Euclidean distance, but then clustered each document based on cosine similarity to the centroid. It seemed to work pretty well.

Now I am trying to do this on a much larger corpus of documents. K-means is not converging, and I'm wondering if it's a bug in my code. I read recently that you shouldn't cluster using cosine similarity, because k-means only works on Euclidean distance. Even though, as I mentioned, it appeared to work fine in my smaller test case.

Now I come across this on the LSA Wikipedia page:

Documents and term vector representations can be clustered using traditional clustering algorithms like k-means using similarity measures like cosine.

So which is it? Can I use cosine similarity or not?

Best Answer

Yes, you can use it. The problem is, that the cosine similarity is not a distance, that is why it is called similarity. Nevertheless, it can be converted to a distance as explained here.

In fact, you can just use any distance. A very nice study of the properties of distance functions in high dimensional spaces (like it is usually the case in information retrieval) is On the Surprising Behavior of Distance Metrics in High Dimensional Space. It does not compare Euclidean vs. cosine though.

I came across with this study where they claim that in high dimensional spaces, both distances tend to behave similarly.

Related Solutions

Solved – Can LSA be used for document similarity

In general LSA is meaningful for computing document similarity. However, you need a large collection of documents (more than 100000) because LSA is based on finding associations between words (e.g. it will find that dog and cat are similar words and therefore a document about dogs is similar to a document about cats). If your collection is small no meaningful associations between words can be derived. LSA is just a change of representation and to compute similarity you still will use the cosine on the LSA representation. Originally each document is a sparse vector of dimension e.g. 100000, but after LSA it is a dense vector of dimension e.g. 200.

As you said you already can do cosine similarity on the sparse data (just transformed word counts). Hopefully you already applied stop-wording, stemming and tf-idf normalization. It's useful to know what these transformations achieve because LSA is just another transformation on top of those standard transformation. I'll briefly go over the usefulness of those transformation before I describe what LSA does.

stop wording. Document content is dominated by stop words. Those words are everywhere and if you do similarity without removing them a large portion of the similarity score will be due to stop words. This means noise and making the similarity less precise.
- stemming. Take for example words such as dog and dogs. If you don't do stemming you are missing a chance to improve on the similarity because clearly dog and dogs are related.
- tf-idf normalization. Here the issue is that some words often stop words, but also borderline stopwords (e.g. "reach", "achieve") are going to have high counts. As such the similarity will be dominated by those counts.

If you did all those, you already have a very strong similarity.

One more transformation you can do is to consider related words: for example dog, hound, animal, cat etc. are related. Some of these relationships will be meanigful for your similarity score while others will not. One way to describe LSA, is that LSA first derives such relationships. Those relationships are derived by statistical co-occurence analysis. For example, cats and dogs co-occur frequently and therefore a document about dogs will be more similar to a document about cats than to a document about tomatos. However, sometimes this may not be what you want: austria and germany are similar so if you search about germany you can get documents about austria. In general, LSA will make sense if you want to compare documents based on their topics.

What LSA does is to "enrich" or "expand" each document with related words. One way to achieve this is to just use the LSA representation. The LSA representation of a document is just the sum of the LSA representation of the individual words.

Another more controlled way still using LSA to perform document "expansion" is to take the related words that LSA derives for each word in the document and to selectively "expand" the original document with more words from LSA.

LSA will not work equally well for all documents or all words. Documents whose topics dominate in the collection will likely be improved, while "outlier" documents will be mapped to a "noise" document. Rare words will not have meaningful similar words no matter if LSA is used or not.

Another drawback of LSA is that the representation after LSA is dense, while the original representation is sparse. This means that you will not be able to find the most similar documents quickly using a search engine. However, LSA can be easily applied in a rescoring phase after candidates are found with keyword search. LSA can also be used to derive informative words from a document.

Solved – How to cluster LDA/LSI topics generated by gensim

This is an example. You need copy matutils.py and utils.py from gensim first, and the directory should like the pic blow.

enter image description here

The code blow should be in doc_similar.py. Then just move your data_file into directory data and change fname in function main.

#coding:utf-8

from gensim import corpora, models, similarities
import cPickle
import logging
import utils
import os
import numpy as np
import scipy
import matutils
from collections import defaultdict

data_dir = os.path.join(os.getcwd(), 'data')
work_dir = os.path.join(os.getcwd(), 'model', os.path.basename(__file__).rstrip('.py'))
if not os.path.exists(work_dir):
    os.mkdir(work_dir)
os.chdir(work_dir)

logger = logging.getLogger('text_similar')
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# convert to unicode
def to_unicode(text):
    if not isinstance(text, unicode):
        text = text.decode('utf-8')
    return text

class TextSimilar(utils.SaveLoad):
    def __init__(self):
        self.conf = {}

    def _preprocess(self):
        docs = [to_unicode(doc.strip()).split()[1:] for doc in file(self.fname)]
        cPickle.dump(docs, open(self.conf['fname_docs'], 'wb'))

        dictionary = corpora.Dictionary(docs)
        dictionary.save(self.conf['fname_dict'])

        corpus = [dictionary.doc2bow(doc) for doc in docs]
        corpora.MmCorpus.serialize(self.conf['fname_corpus'], corpus)

        return docs, dictionary, corpus

    def _generate_conf(self):
        fname = self.fname[self.fname.rfind('/') + 1:]
        self.conf['fname_docs']   = '%s.docs' % fname
        self.conf['fname_dict']   = '%s.dict' % fname
        self.conf['fname_corpus'] = '%s.mm' % fname

    def train(self, fname, is_pre=True, method='lsi', **params):
        self.fname = fname
        self.method = method
        self._generate_conf()
        if is_pre:
            self.docs, self.dictionary, corpus = self._preprocess()
        else:
            self.docs = cPickle.load(open(self.conf['fname_docs']))
            self.dictionary = corpora.Dictionary.load(self.conf['fname_dict'])
            corpus = corpora.MmCorpus(self.conf['fname_corpus'])

        if params is None:
            params = {}

        logger.info("training TF-IDF model")
        self.tfidf = models.TfidfModel(corpus, id2word=self.dictionary)
        corpus_tfidf = self.tfidf[corpus]

        if method == 'lsi':
            logger.info("training LSI model")
            self.lsi = models.LsiModel(corpus_tfidf, id2word=self.dictionary, **params)
            self.similar_index = similarities.MatrixSimilarity(self.lsi[corpus_tfidf])
            self.para = self.lsi[corpus_tfidf]
        elif method == 'lda_tfidf':
            logger.info("training LDA model")
            self.lda = models.LdaMulticore(corpus_tfidf, id2word=self.dictionary, workers=8, **params)
            self.similar_index = similarities.MatrixSimilarity(self.lda[corpus_tfidf])
            self.para = self.lda[corpus_tfidf]
        elif method == 'lda':
            logger.info("training LDA model")
            self.lda = models.LdaMulticore(corpus, id2word=self.dictionary, workers=8, **params)
            self.similar_index = similarities.MatrixSimilarity(self.lda[corpus])
            self.para = self.lda[corpus]
        elif method == 'logentropy':
            logger.info("training a log-entropy model")
            self.logent = models.LogEntropyModel(corpus, id2word=self.dictionary)
            self.similar_index = similarities.MatrixSimilarity(self.logent[corpus])
            self.para = self.logent[corpus]
        else:
            msg = "unknown semantic method %s" % method
            logger.error(msg)
            raise NotImplementedError(msg)

    def doc2vec(self, doc):
        bow = self.dictionary.doc2bow(to_unicode(doc).split())
        if self.method == 'lsi':
            return self.lsi[self.tfidf[bow]]
        elif self.method == 'lda':
            return self.lda[bow]
        elif self.method == 'lda_tfidf':
            return self.lda[self.tfidf[bow]]
        elif self.method == 'logentropy':
            return self.logent[bow]

    def find_similar(self, doc, n=10):
        vec = self.doc2vec(doc)
        sims = self.similar_index[vec]
        sims = sorted(enumerate(sims), key=lambda item: -item[1])
        for elem in sims[:n]:
            idx, value = elem
            print ' '.join(self.docs[idx]), value

    def get_vectors(self):
        return self._get_vector(self.para)

    def _get_vector(self, corpus):

        def get_max_id():
            maxid = -1
            for document in corpus:
                maxid = max(maxid, max([-1] + [fieldid for fieldid, _ in document])) # [-1] to avoid exceptions from max(empty)
            return maxid

        num_features = 1 + get_max_id()
        index = np.empty(shape=(len(corpus), num_features), dtype=np.float32)
        for docno, vector in enumerate(corpus):
            if docno % 1000 == 0:
                print("PROGRESS: at document #%i/%i" % (docno, len(corpus)))

            if isinstance(vector, np.ndarray):
                pass
            elif scipy.sparse.issparse(vector):
                vector = vector.toarray().flatten()
            else:
                vector = matutils.unitvec(matutils.sparse2full(vector, num_features))
            index[docno] = vector        

        return index


def cluster(vectors, ts, k=30):
    from sklearn.cluster import k_means
    X = np.array(vectors)
    cluster_center, result, inertia = k_means(X.astype(np.float), n_clusters=k, init="k-means++")
    X_Y_dic = defaultdict(set)
    for i, pred_y in enumerate(result):
        X_Y_dic[pred_y].add(''.join(ts.docs[i]))

    print 'len(X_Y_dic): ', len(X_Y_dic)
    with open(data_dir + '/cluser.txt', 'w') as fo:
        for Y in X_Y_dic:
            fo.write(str(Y) + '\n')
            fo.write('{word}\n'.format(word='\n'.join(list(X_Y_dic[Y])[:100])))

def main(is_train=True):
    fname = data_dir + '/brand'

    num_topics = 100
    method = 'lda'

    ts = TextSimilar()
    if is_train:
        ts.train(fname, method=method ,num_topics=num_topics, is_pre=True, iterations=100)
        ts.save(method)
    else:   
        ts = TextSimilar().load(method)

    index = ts.get_vectors()
    cluster(index, ts, k=num_topics)

if __name__ == '__main__':
    is_train = True if len(sys.argv) > 1 else False
    main(is_train)

Best Answer

Related Solutions

Solved – Can LSA be used for document similarity

Solved – How to cluster LDA/LSI topics generated by gensim

Related Question