Solved – Dealing with lots of ties in kNN model

k nearest neighbourr

I have a large data set (400k rows X 60 columns) that I'm trying to use to build a knn model. I'm using the caret package version of knn and the forward.search method from the FSelector package to eliminate variables via cross-validation. My problem is that once I use more than 20k lines of data I get a message about there being too many ties.

Currently I'm only checking k-values between 1-19 (and only odd #'s as they supposedly shrink risk of ties) and only using variables with > 2 levels.

Are there any other tweaks to using big chunks of data into a knn?

EDIT: This is regression problem, not a classification problem.

Best Answer

In some situation you have a lot of data items that are might be considered to be tied in distance, especially if your data is discrete (e.g. your matrix is made up of integers).

A "hack" that might be able to work is that you add a very small pseudo-random noise to the data. This will reduce the number of data items that happen to be equidistant. Note that the noise should be as small as possible so as to bias the results but large enough to reduce the ties.