Solved – Cosine similarity indexing

cosine similarity

Are there any open source implementations out there that can efficiently solve the following.

  1. I'm given a fixed set $S$ of $n$-dimensional vectors of size $N$, where $N$ is of the order of a million.

  2. Given an n-dimensional vector $v$, I want to find the top $K$ vectors $w_1,\ldots,w_k$ from $S$, such that the cosine similarity of $v$ and each $w_i$ is maximized.

Here $K$ should be a parameter that I can choose at query time. I know there are various metric data structures that can be used for queries such as these. There's also a paper from Google from last NIPS, where they do this. For example this paper from Google uses an internal library for it:

http://papers.nips.cc/paper/7157-multiscale-quantization-for-fast-similarity-search

Best Answer

If your vectors are dense, you can use a map-reduce code for matrix multiplication (its the best which can be said with the limited information you have provided).

If the vectors are sparse, you can do much better. See

http://www.jmlr.org/papers/volume14/bosagh-zadeh13a/bosagh-zadeh13a.pdf

The authors provide code for their work

Related Question