Solved – Doing low-dimensional KNN on a large dataset

k nearest neighbour

I have a simple two-dimensional dataset with columns X1,X2, and [outcome], and I want to try KNN (probably K around 100 or 1000, though ideally CV would be possible). The problem is that my dataset has a couple million rows.

Are there any preferred packages/approaches for dealing with this sort of thing?

Best Answer

A naive nearest neighbor implementation will have to compute the distances between your test example and every instance in the training set. This $O(n)$ process can be problematic if you have a lot of data.

One solution is to find a more efficient representation of the training data. "Space-partitioning" data structures organize points in a way that makes it possible to efficiently search through them. Using a $k$-d tree, one can find a point's nearest neighbor in $O(\log n)$ time instead, which is substantial speed-up.

There are also approximate nearest neighbor algorithms such as locality sensitive hashing or the best-bin first algorithm. These results are approximations--they don't always find the nearest neighbor, but they often find something very close to it (which is probably just as good for classification.

Finally, if you've got a relatively fixed training set, you could compute something like a Voronoi diagram that indicates the neighborhoods around each point. This could then be used as a look-up table for future queries.


Related Question