Solved – how to find the nearest neighbor of a sparse vector

algorithmsinformation retrievalk nearest neighbourmachine learning

I have about 500 vectors, each vector is a 1500-dimension vector, and almost every vector is very sparse– I mean only about 30-70 dimension of the vector is not 0。

Now, the problem is that here is a given vetor, also 1500 dimension, and I need to compare it to the 500 vectors to find which of the 500 is the nearest one.(In euclidean distance).

There is no doubt that brute-force method is a solution, but I need to calculate the distance for 500 times, which takes a long time.

Yesterday I read an article “Object retrieval with large vocabularies and fast spatial matching”, it says using inverted index will help:
enter image description here
But after my test, it made almost no sense, imagine a 1500-vector in which 50 of the dimension are not zero, when it comes to another one, they may always have the same dimension that are not zero. In other words, this algorithm can only rule out a little vectors, I still need to compare with many vectors left.

Thank you for your nice that you have read to here, my questions are:

  1. Will this algorithm make sense?

  2. Is there any other way to do what I want to do? Such as flann or Kd-TREE? But I want the exact accurate nearest neighbor, an approximate one is not enough.

Best Answer

An inverted index is really useful when you consider using a cosine distance, as you have to compute the dot product of say x and y which is simply: $$ <x,y> = \sum_{i=0}^N x_i.y_i $$ so you indeed only need to compare the document x to the one that have non-zero coordinates only when x_i != 0. I think your article was talking about that.

Unfortunately there's no such "easy trick" for euclidean distance that uses $$\|x-y\|^2 = <x-y,x-y> $$ instead. There are a lot of things to imagine if you need only an ANN but if you truly want the nearest one it can get ugly. Most of the ANN will give you the closest though - simplest is LSH for partitionning e.g., or LSH-forest.

Note also that KD-trees work great for small dimensions, but get close to linear as the dimension of the space increases, so you'll still get a brute force complexity. I should double-check but I remember a paper proving that partitionning tend to get linear as the dimension increases.

Related Question