Solved – Euclidean Distance b/t unit vectors or cosine similarity where vectors are document vectors

clusteringdistance-functionsinformation retrievalsimilaritiestext mining

I was reading Similarity Measures and suddenly my whole world was falling apart. I have implemented a search engine using clustering techniques. For clustering, I used k means which uses Euclidean distance as its distance measure. I also used cosine similarity to display results. I was getting amazingly accurate results. But now that I read this, what I did was normalize the document vectors and calculated the euclidean distance between two resulting unit vectors and hence I have not considered magnitude anywhere.

Am I doing something wrong?

Although I think that a higher term frequency would make up for a higher tf-idf value and a higher normalized tf-idf value and hence would be appropriately ranked high.

Results (not using normalized vectors, the figures are euclidean distances)

61.79689257425985 222Proposed Research Details.doc
144.15451315901478 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_
                      alternate_sustainable_land_management_options_for_tribal_
                      dominated_watersheds_RRPS_24.doc
72.61392308146608 done_Developing live fencing systems for soil & water conservation_
                      NATIP-RNPS-3 SKN Math).doc
72.96125277156261 done_Management strategies for impriing rabi (SKN Math).doc
65.51734241367222 done_RPFIII_dr.dogra.doc
66.72042766100921 Evaluation of crops and their varieties (SKN Math).doc
418.8868087170988 P. VIJAYA KUMAR (DSS).doc
140.3914521621597 RPF - I PIMS-ICAR project proposal for IASRI.doc
72.95414421468679 RPF-III__Indo-US_project.doc
82.25126123574397 220Introduction and objectives.doc

Results (with normalized vectors, the figures are euclidean distances)

1.3435369899385359 222Proposed Research Details.doc
1.1277471087250086 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_
                       alternate_sustainable_land_management_options_for_tribal_
                       dominated_watersheds_RRPS_24.doc
1.2741267093494966 done_Developing live fencing systems for soil & water conservation_
                       NATIP-RNPS-3 SKN Math).doc
1.264154265747389 done_Management strategies for impriing rabi (SKN Math).doc
1.2902191708899362 done_RPFIII_dr.dogra.doc
1.3128744973475515 Evaluation of crops and their varieties (SKN Math).doc
0.4924243033927417 P. VIJAYA KUMAR (DSS).doc
1.1747048933792805 RPF - I PIMS-ICAR project proposal for IASRI.doc
1.29150899172647 RPF-III__Indo-US_project.doc
1.318016051789028 220Introduction and objectives.doc

Results (figures are cosine similarity)

0.09745417833344654 222Proposed Research Details.doc
0.36409322938119104 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_
                       alternate_sustainable_land_management_options_for_tribal_
                       dominated_watersheds_RRPS_24.doc
0.1883005642611103 done_Developing live fencing systems for soil & water conservation_
                       NATIP-RNPS-3 SKN Math).doc
0.2009569961963377 done_Management strategies for impriing rabi (SKN Math).doc
0.16766724553404047 done_RPFIII_dr.dogra.doc
0.13818027710720598 Evaluation of crops and their varieties (SKN Math).doc
0.8787591527140649 P. VIJAYA KUMAR (DSS).doc
0.3100342067353838 RPF - I PIMS-ICAR project proposal for IASRI.doc
0.16600226214483405 RPF-III__Indo-US_project.doc
0.13141684361322944 220Introduction and objectives.doc

The results 1 and 2 do not agree with each other while 2 and 3 strongly do. More similarity, lesser distance. The distances are taken between cluster centroid vector and the document vectors of each of the document.

In fact, the weirdest result is the document with a euclidean distance of 418 and having the most similarity of 0.87. while normalized distance becomes 0.49 and agrees with similarity.

Best Answer

Do not compare the raw numbers.

An Euclidean distance of $x$ and a cosine similarity of $y$ and Euclidean distance on normalized vectors of $z$ live on different scales and just cannot be compared. This is like comparing "shoe size" with "mass in lb". There probably is some relationship, but none that you can see by looking at a small sample.

k-means optimizes Euclidean distances (and it shouldn't really be used with other distances, at the mean may not be a least squares estimator of the central object). So the clusters you computed already favor Euclidean neighbors. If you then refine your search using Cosine distance, it will likely be only within the subset that was already Euclidean-similar.

So a lot of the things you observe probably are artifacts from the clustering preprocessing you did and similar aspects of your code.

Note that when using non-normalized vectors, you essentially put a high weight on the document size. This is why TF-IDF is commonly used: a document and the concatenation of two copies of the same document then have the same vector.

Related Question